Difference between revisions of "Corpora/Top"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Activity D: Very Large Corpora = For English, the UiO team has assessed the following: * GigaWords 5.5: * Wikipedia: * COW: * Common Crawl: * NOW: So far, we have st...")
 
(Activity D: Very Large Corpora)
 
(One intermediate revision by the same user not shown)
Line 3: Line 3:
 
For English, the UiO team has assessed the following:
 
For English, the UiO team has assessed the following:
  
  * GigaWords 5.5:
+
* Reuters Corpus: custom license
  * Wikipedia:
+
* NANC: LDC-licensed
  * COW:
+
  * GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens
  * Common Crawl:
+
  * Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going)
 +
  * COW: custom crawl; sentence-shuffled
 +
  * Common Crawl: 140 billion
 
  * NOW:
 
  * NOW:
  
 
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization.
 
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization.
 
Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.
 
Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.

Latest revision as of 05:17, 21 March 2017

Activity D: Very Large Corpora

For English, the UiO team has assessed the following:

* Reuters Corpus: custom license
* NANC: LDC-licensed
* GigaWords 5: LDC-licensed; newswire; 4.676 billion tokens
* Wikipedia: Wikipedia Extractor 2.129 billion tokens (Wikipedia Corpus Builder on-going)
* COW: custom crawl; sentence-shuffled
* Common Crawl: 140 billion
* NOW:

So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.