Difference between revisions of "Corpora/Top"
(→Activity D: Very Large Corpora) |
(→Activity D: Very Large Corpora) |
||
Line 5: | Line 5: | ||
* Reuters Corpus: custom license | * Reuters Corpus: custom license | ||
* NANC: LDC-licensed | * NANC: LDC-licensed | ||
− | * GigaWords 5: LDC-licensed; newswire; | + | * GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens |
− | * Wikipedia: Wikipedia Extractor (Wikipedia Corpus Builder on-going) | + | * Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going) |
* COW: custom crawl; sentence-shuffled | * COW: custom crawl; sentence-shuffled | ||
* Common Crawl: 140 billion | * Common Crawl: 140 billion |
Latest revision as of 05:17, 21 March 2017
Activity D: Very Large Corpora
For English, the UiO team has assessed the following:
* Reuters Corpus: custom license * NANC: LDC-licensed * GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens * Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going) * COW: custom crawl; sentence-shuffled * Common Crawl: 140 billion * NOW:
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.