Difference between revisions of "Corpora/Top"
(Created page with "= Activity D: Very Large Corpora = For English, the UiO team has assessed the following: * GigaWords 5.5: * Wikipedia: * COW: * Common Crawl: * NOW: So far, we have st...") |
(→Activity D: Very Large Corpora) |
||
(One intermediate revision by the same user not shown) | |||
Line 3: | Line 3: | ||
For English, the UiO team has assessed the following: | For English, the UiO team has assessed the following: | ||
− | * GigaWords 5. | + | * Reuters Corpus: custom license |
− | * Wikipedia: | + | * NANC: LDC-licensed |
− | * COW: | + | * GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens |
− | * Common Crawl: | + | * Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going) |
+ | * COW: custom crawl; sentence-shuffled | ||
+ | * Common Crawl: 140 billion | ||
* NOW: | * NOW: | ||
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. | So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. | ||
Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP. | Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP. |
Latest revision as of 05:17, 21 March 2017
Activity D: Very Large Corpora
For English, the UiO team has assessed the following:
* Reuters Corpus: custom license * NANC: LDC-licensed * GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens * Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going) * COW: custom crawl; sentence-shuffled * Common Crawl: 140 billion * NOW:
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.