Corpora/Top

From Nordic Language Processing Laboratory
(Difference between revisions)
Jump to: navigation, search
(Created page with "= Activity D: Very Large Corpora = For English, the UiO team has assessed the following: * GigaWords 5.5: * Wikipedia: * COW: * Common Crawl: * NOW: So far, we have st...")

Revision as of 23:49, 20 March 2017

Activity D: Very Large Corpora

For English, the UiO team has assessed the following:

* GigaWords 5.5:
* Wikipedia:
* COW:
* Common Crawl:
* NOW:

So far, we have standardized on CoreNLP for corpus pre-processing, viz. sentence splitting, tokenization, part of speech tagging, and lemmatization. Each corpus is available in two forms: (a) ‘clean’ running text and (b) pre-processed into the CoNLL-like (tab-separated) output format of CoreNLP.

Personal tools
Namespaces

Variants
Actions
Navigation
Tools