Latest revision as of 16:06, 4 February 2019

Background

NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.

These data resources are available from the connected infrastructure, below the project directory /projects/nlpl/data/corpora/ or /proj/nlpl/data/corpora/ on Abel and Taito, respectively. In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see the summary table below to determine which resource is available on which system.

Corpus Catalogue

Directory	Description	System	Install Date	Maintainer
conll17	Corpora for 46 Languages (from Wikipedia and Common Crawl)	Abel, Taito	November 2017	Jenna Kaverna
EngC3	130 Billion Tokens of ‘Clean’ Text from the Common Crawl	Abel, Taito	April 2018	Stephan Oepen

Many Languages: The Collection of Open Parallel Corpora (Helsinki)

OPUS provides parallel texts in many languages extracted from a broad range of freely available sources In terms of the NLPL-internal project structure, it is actually a task of its own (because there are additional on-line services connected to OPUS), hence there is a separate wiki page with additional information on OPUS.

Many Languages: The CoNLL 2017 Text Collection (Turku)

The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic. A number of filtering and deduplication steps have been applied, to arrive at a relatively clean extracted text. The texts are segmented and fully parsed with the UDPipe v1.1 parser. The processing pipeline is described in greater detail in Section 2.2 of this paper: [1] This collection was used as supporting data in the 2017 and 2018 CoNLL shared tasks. In addition some of the word2vec and ELMo embeddings distributed within NLPL are based on this data.

The raw, unprocessed version of the crawls will be available (at least) on Taito and re-parsing this data with one of the top-ranking parsers in the 2018 Shared Task is under consideration.

English: 130 Billion Words Extracted from the Common Crawl (Oslo)

English: Two Variants of Text Extraction from Wikipedia (Oslo)

@@ Line 32: / Line 32: @@
 = Many Languages: The CoNLL 2017 Text Collection (Turku) =
-xxx
+The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic. A number of filtering and deduplication steps have been applied, to arrive at a relatively clean extracted text. The texts are segmented and fully parsed with the UDPipe v1.1 parser. The processing pipeline is described in greater detail in Section 2.2 of this paper: [http://universaldependencies.org/conll17/proceedings/pdf/K17-3001.pdf] This collection was used as supporting data in the 2017 and 2018 CoNLL shared tasks. In addition some of the word2vec and ELMo embeddings distributed within NLPL are based on this data.
+The raw, unprocessed version of the crawls will be available (at least) on Taito and re-parsing this data with one of the top-ranking parsers in the 2018 Shared Task is under consideration.
 = English: 130 Billion Words Extracted from the Common Crawl (Oslo) =
 = English: Two Variants of Text Extraction from Wikipedia (Oslo) =

Difference between revisions of "Corpora/home"

Latest revision as of 16:06, 4 February 2019

Contents

Background

Corpus Catalogue

Many Languages: The Collection of Open Parallel Corpora (Helsinki)

Many Languages: The CoNLL 2017 Text Collection (Turku)

English: 130 Billion Words Extracted from the Common Crawl (Oslo)

English: Two Variants of Text Extraction from Wikipedia (Oslo)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools