Difference between revisions of "Corpora/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Large Corpora = NLPL provides various large data sets. They are available from the connected infrastructure. Please, check the individual pages of each resource. * [[Corpo...")
 
(Corpus Catalogue)
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Large Corpora =
+
= Background =
  
NLPL provides various large data sets. They are available from the connected infrastructure. Please, check the individual pages of each resource.
 
  
* [[Corpora/OPUS|OPUS - the collection of open parallel corpora]]
+
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.
 +
 
 +
These data resources are available from the connected infrastructure, below the project directory
 +
<tt>/projects/nlpl/data/corpora/</tt> or <tt>/proj/nlpl/data/corpora/</tt> on Abel and Taito, respectively.
 +
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see
 +
the summary table below to determine which resource is available on which system.
 +
 
 +
= Corpus Catalogue =
 +
 
 +
{| class="wikitable"
 +
|-
 +
! Directory !! Description !! System !! Install Date !! Maintainer
 +
|-
 +
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna
 +
|-
 +
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen
 +
|}
 +
 
 +
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =
 +
 
 +
 
 +
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from
 +
a broad range of freely available sources
 +
In terms of the NLPL-internal
 +
project structure, it is actually a task of its own (because there are
 +
additional on-line services connected to OPUS), hence there is a
 +
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.
 +
 
 +
= Many Languages: The CoNLL 2017 Text Collection (Turku) =
 +
 
 +
 
 +
 
 +
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =
 +
 
 +
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =

Revision as of 11:23, 19 September 2018

Background

NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.

These data resources are available from the connected infrastructure, below the project directory /projects/nlpl/data/corpora/ or /proj/nlpl/data/corpora/ on Abel and Taito, respectively. In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see the summary table below to determine which resource is available on which system.

Corpus Catalogue

Directory Description System Install Date Maintainer
conll17 Corpora for 46 Languages (from Wikipedia and Common Crawl) Abel, Taito November 2017 Jenna Kaverna
EngC3 130 Billion Tokens of ‘Clean’ Text from the Common Crawl Abel, Taito April 2018 Stephan Oepen

Many Languages: The Collection of Open Parallel Corpora (Helsinki)

OPUS provides parallel texts in many languages extracted from a broad range of freely available sources In terms of the NLPL-internal project structure, it is actually a task of its own (because there are additional on-line services connected to OPUS), hence there is a separate wiki page with additional information on OPUS.

Many Languages: The CoNLL 2017 Text Collection (Turku)

English: 130 Billion Words Extracted from the Common Crawl (Oslo)

English: Two Variants of Text Extraction from Wikipedia (Oslo)