Corpora/Top

From Nordic Language Processing Laboratory
Revision as of 23:04, 20 March 2017 by Oe (talk | contribs) (Activity D: Very Large Corpora)
Jump to: navigation, search

Activity D: Very Large Corpora

For English, the UiO team has assessed the following:

* Reuters Corpus: custom license
* NANC: LDC-licensed
* GigaWords 5: LDC-licensed; newswire; 
* Wikipedia: Wikipedia Extractor (Wikipedia Corpus Builder on-going)
* COW: custom crawl; sentence-shuffled
* Common Crawl: 140 billion
* NOW:

So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.