Difference between revisions of "Eosc/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Available Bokmål Text Corpora)
 
(23 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
''These are just working notes. See the [[Vectors/norlm/norbert|official NorBERT release announcement]].''
 +
 
= Working Notes for Norwegian BERT-Like Models =
 
= Working Notes for Norwegian BERT-Like Models =
  
Line 6: Line 8:
  
 
[https://github.uio.no/andreku/NorBERT NorBERT tools]
 
[https://github.uio.no/andreku/NorBERT NorBERT tools]
 +
 +
= Status for the joint model =
 +
# Training corpus: '''prepared'''
 +
# Training corpus [https://github.uio.no/andreku/NorBERT/blob/master/sentence_segment.slurm segmentation]: '''complete'''
 +
# SentencePiece vocabulary: '''[https://github.uio.no/andreku/NorBERT/tree/master/vocabulary created]'''
 +
# TF Records for the Phase 1 (sequence length 128): '''complete''' (''/cluster/projects/nn9851k/andreku/norbert_data/norbert128/'')
 +
# Training for the Phase 1: '''complete'''
 +
# TF Records for the Phase 2 (sequence length 512): '''complete''' (''/cluster/projects/nn9851k/andreku/norbert_data/norbert512/'')
 +
# Training for the Phase 2: '''in the process'''
  
 
= Available Bokmål Text Corpora =
 
= Available Bokmål Text Corpora =
Line 12: Line 23:
 
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus]; 1.7 billion words; sentences are ordered; clean;
 
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus]; 1.7 billion words; sentences are ordered; clean;
 
*[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Wikipedia]; 160 million words; sentences are ordered; clean (more or less);
 
*[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Wikipedia]; 160 million words; sentences are ordered; clean (more or less);
 +
 +
 +
NAK Bøkmal: 712 145 669 word tokens
 +
 +
NAK Nynorsk: 47 180 985 word tokens
 +
 +
NAK unspecified: 1 119 744 725 word tokens
 +
 +
We start with training a joint BERT model. Meanwhile, we will run [https://github.com/saffsd/langid.py langid] on unspecified texts to later train separate models.
  
 
== We currently do not use ==
 
== We currently do not use ==
Line 19: Line 39:
  
 
= Special stuff for Norsk Aviskorpus =
 
= Special stuff for Norsk Aviskorpus =
'''/cluster/projects/nn9447k/andreku/norbert_corpora/NAK/'''
+
''/cluster/projects/nn9851k/andreku/norbert_corpora/NAK/''
  
 
1. Post-2011 archives contain XML files, one document per file, UTF-8 encoding. A simple XML reader will extract text from  them easily. No problems here.
 
1. Post-2011 archives contain XML files, one document per file, UTF-8 encoding. A simple XML reader will extract text from  them easily. No problems here.
  
 
2. For years up to 2005 ('1998-2011/1/' subdirectory), the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs.
 
2. For years up to 2005 ('1998-2011/1/' subdirectory), the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs.
'''Will have to decide on how exactly to convert it to running text.''' [https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl Moses de-tokenizer]?
+
'''Will have to decide on how exactly to convert it to running text.''' [https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl Moses de-tokenizer]?  
 +
'''DONE''' using a self-made tokenizer (in the repo)
  
 
3. Everything up to and including 2011 ('1998-2011/' subdirectory) is in the ISO 8859-01 encoding ('Latin-1'). The '1998-2011/3' subdirectory contains XML files which are in 8859-01 as well, although some of them falsely claim (in their headers) to be UTF-8.  
 
3. Everything up to and including 2011 ('1998-2011/' subdirectory) is in the ISO 8859-01 encoding ('Latin-1'). The '1998-2011/3' subdirectory contains XML files which are in 8859-01 as well, although some of them falsely claim (in their headers) to be UTF-8.  
'''Must convert to UTF-8 before any other pre-processing.'''
+
Must convert to UTF-8 before any other pre-processing.
 +
'''DONE'''
  
 
= Preprocessing and Tokenization =
 
= Preprocessing and Tokenization =
  
 +
1. Check quotes after detokenization    '''DONE'''
 +
 +
2. Obtain Nynorsk Wikipedia (since we are going to train a joint BERT model) '''DONE'''
 +
 +
3. Find out how Stanza evaluates sentence segmentation. Compare its performance with UDPipe and Punkt. ('''POSTPONED''')
 +
 +
4. Train a joint Stanza sentence segmenter on Nynorsk and Bokmål '''if necessary''' (are embeddings needed?). ('''POSTPONED''')
 +
 +
5. Sentence-segment the corpora. '''DONE'''
 +
 +
== Vocabulary ==
 
[https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.
 
[https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.
  
Line 36: Line 69:
  
 
This is an issue of balancing between the needs of a '''naive user''' (who wants to avoid any pre-processing) and the needs of a '''computational linguist''' (who arguably wants to have more linguistically meaningful tokens at the output).
 
This is an issue of balancing between the needs of a '''naive user''' (who wants to avoid any pre-processing) and the needs of a '''computational linguist''' (who arguably wants to have more linguistically meaningful tokens at the output).
 +
 +
'''We decided that the default model is trained on raw text, but if time allows, a `tokenized' model should be trained for comparison.'''
  
 
== Training input file format ==
 
== Training input file format ==
Line 44: Line 79:
  
 
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
 
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
 +
 +
3. Text files are converted to TFRecords. Each TFR is about 60 times larger than the original gzipped text file. We need about 300 GB to store TFRs for sequence length 128 for our full training corpus.
 +
 +
= Training details =
 +
Batch size: '''96''' (EngBERT: 256, FinBERT: 140)
 +
 +
Global batch size (8 GPUs): '''768''' (EngBERT: 4096, FinBERT: 1120)
 +
 +
Target epochs over the full corpus: '''3''' (EngBERT: 40, FinBERT: 3)
 +
 +
Target training steps: '''795 000''' (EngBERT: 1 000 000, FinBERT: 1 000 000)
 +
 +
The model will train on approximately '''680M''' sentences in the end (EngBERT: 4B, FinBERT: 1.1B).
 +
 +
Time for 1 epoch: 133 hours / 6 days
  
 
= Evaluation =
 
= Evaluation =
 
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?
 
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?
 
Please see [[Eosc/norbert/benchmark]] for a discussion.
 
Please see [[Eosc/norbert/benchmark]] for a discussion.

Latest revision as of 22:19, 12 January 2021

These are just working notes. See the official NorBERT release announcement.

Working Notes for Norwegian BERT-Like Models

Report on the creation of FinBERT: https://arxiv.org/pdf/1912.07076.pdf

Working NVIDIA implementation workflow on Saga

NorBERT tools

Status for the joint model

  1. Training corpus: prepared
  2. Training corpus segmentation: complete
  3. SentencePiece vocabulary: created
  4. TF Records for the Phase 1 (sequence length 128): complete (/cluster/projects/nn9851k/andreku/norbert_data/norbert128/)
  5. Training for the Phase 1: complete
  6. TF Records for the Phase 2 (sequence length 512): complete (/cluster/projects/nn9851k/andreku/norbert_data/norbert512/)
  7. Training for the Phase 2: in the process

Available Bokmål Text Corpora

We use


NAK Bøkmal: 712 145 669 word tokens

NAK Nynorsk: 47 180 985 word tokens

NAK unspecified: 1 119 744 725 word tokens

We start with training a joint BERT model. Meanwhile, we will run langid on unspecified texts to later train separate models.

We currently do not use

  • noWAC; 700 million words; sentences are ordered; semi-clean;
  • CommonCrawl from CoNLL 2017; 1.3 billion words; sentences are shuffled; not clean;
  • NB Digital; 200 million words; sentences are ordered; semi-clean (OCR quality varies).

Special stuff for Norsk Aviskorpus

/cluster/projects/nn9851k/andreku/norbert_corpora/NAK/

1. Post-2011 archives contain XML files, one document per file, UTF-8 encoding. A simple XML reader will extract text from them easily. No problems here.

2. For years up to 2005 ('1998-2011/1/' subdirectory), the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. Will have to decide on how exactly to convert it to running text. Moses de-tokenizer? DONE using a self-made tokenizer (in the repo)

3. Everything up to and including 2011 ('1998-2011/' subdirectory) is in the ISO 8859-01 encoding ('Latin-1'). The '1998-2011/3' subdirectory contains XML files which are in 8859-01 as well, although some of them falsely claim (in their headers) to be UTF-8. Must convert to UTF-8 before any other pre-processing. DONE

Preprocessing and Tokenization

1. Check quotes after detokenization DONE

2. Obtain Nynorsk Wikipedia (since we are going to train a joint BERT model) DONE

3. Find out how Stanza evaluates sentence segmentation. Compare its performance with UDPipe and Punkt. (POSTPONED)

4. Train a joint Stanza sentence segmenter on Nynorsk and Bokmål if necessary (are embeddings needed?). (POSTPONED)

5. Sentence-segment the corpora. DONE

Vocabulary

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

Should we assume that the input to the trained model will be tokenized text (punctuation marks separated from words) or not?

This is an issue of balancing between the needs of a naive user (who wants to avoid any pre-processing) and the needs of a computational linguist (who arguably wants to have more linguistically meaningful tokens at the output).

We decided that the default model is trained on raw text, but if time allows, a `tokenized' model should be trained for comparison.

Training input file format

1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because BERT uses the sentence boundaries for the "next sentence prediction" task). Will do sentence-splitting with Stanza.

2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

3. Text files are converted to TFRecords. Each TFR is about 60 times larger than the original gzipped text file. We need about 300 GB to store TFRs for sequence length 128 for our full training corpus.

Training details

Batch size: 96 (EngBERT: 256, FinBERT: 140)

Global batch size (8 GPUs): 768 (EngBERT: 4096, FinBERT: 1120)

Target epochs over the full corpus: 3 (EngBERT: 40, FinBERT: 3)

Target training steps: 795 000 (EngBERT: 1 000 000, FinBERT: 1 000 000)

The model will train on approximately 680M sentences in the end (EngBERT: 4B, FinBERT: 1.1B).

Time for 1 epoch: 133 hours / 6 days

Evaluation

Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT? Please see Eosc/norbert/benchmark for a discussion.