Difference between revisions of "Eosc/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(The tokenization issue was fixed in the UD 2.6 release)
Line 12: Line 12:
 
*[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989# CommonCrawl from CoNLL 2017]; 1.3 billion words; sentences are shuffled; not clean;
 
*[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989# CommonCrawl from CoNLL 2017]; 1.3 billion words; sentences are shuffled; not clean;
 
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-34/ NB Digital]; 200 million words; sentences are ordered; semi-clean (OCR quality varies).
 
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-34/ NB Digital]; 200 million words; sentences are ordered; semi-clean (OCR quality varies).
 +
 +
= Special stuff for Norsk Aviskorpus =
 +
'''/cluster/projects/nn9447k/andreku/norbert_corpora/NAK/'''
 +
 +
1. Post-2011 archives contain XML files, one document per file, UTF-8 encoding. A simple XML reader will extract text from  them easily. No problems here.
 +
 +
2. For years up to 2005 ('1998-2011/1/' subdirectory), the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs.
 +
'''Will have to decide on how exactly to convert it to running text.'''
 +
 +
3. Everything up to and including 2011 ('1998-2011/' subdirectory) is in the ISO 8859-01 encoding ('Latin-1'). The '1998-2011/3' subdirectory contains XML files which are in 8859-01 as well, although some of them falsely claim (in their headers) to be UTF-8.
 +
'''Must convert to UTF-8 before any other pre-processing.'''
  
 
= Preprocessing and Tokenization =
 
= Preprocessing and Tokenization =

Revision as of 18:55, 2 December 2020

Working Notes for Norwegian BERT-Like Models

Report on the creation of FinBERT: https://arxiv.org/pdf/1912.07076.pdf

Working NVIDIA implementation workflow on Saga

Available Bokmål Text Corpora

Special stuff for Norsk Aviskorpus

/cluster/projects/nn9447k/andreku/norbert_corpora/NAK/

1. Post-2011 archives contain XML files, one document per file, UTF-8 encoding. A simple XML reader will extract text from them easily. No problems here.

2. For years up to 2005 ('1998-2011/1/' subdirectory), the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. Will have to decide on how exactly to convert it to running text.

3. Everything up to and including 2011 ('1998-2011/' subdirectory) is in the ISO 8859-01 encoding ('Latin-1'). The '1998-2011/3' subdirectory contains XML files which are in 8859-01 as well, although some of them falsely claim (in their headers) to be UTF-8. Must convert to UTF-8 before any other pre-processing.

Preprocessing and Tokenization

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

Input file format:

1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because BERT uses the sentence boundaries for the "next sentence prediction" task).

2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

Evaluation

Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?