Corpora/Top
Activity D: Very Large Corpora
For English, the UiO team has assessed the following:
* GigaWords 5.5: * Wikipedia: * COW: * Common Crawl: * NOW:
So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.