Difference between revisions of "Corpora/home"
(→Many Languages: The Collection of Open Parallel Corpora (Helsinki)) |
|||
Line 23: | Line 23: | ||
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from | [[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from | ||
− | a broad range of freely available sources | + | a broad range of freely available sources |
+ | In terms of the NLPL-internal | ||
+ | project structure, it is actually a task of its own (because there are | ||
+ | additional on-line services connected to OPUS), hence there is a | ||
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS. | [[Corpora/OPUS|separate wiki page]] with additional information on OPUS. | ||
− | |||
= Many Languages: The CoNLL 2017 Text Collection (Turku) = | = Many Languages: The CoNLL 2017 Text Collection (Turku) = |
Revision as of 13:13, 25 January 2018
Contents
Background
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.
These data resources are available from the connected infrastructure, below the project directory /projects/nlpl/data/corpora/ or /proj/nlpl/data/corpora/ on Abel and Taito, respectively. In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see the summary table below to determine which resource is available on which system.
Corpus Catalogue
Directory | Description | System | Install Date | Maintainer |
---|---|---|---|---|
conll17 | Taito | November 2017 | Jenna Kaverna |
Many Languages: The Collection of Open Parallel Corpora (Helsinki)
OPUS provides parallel texts in many languages extracted from a broad range of freely available sources In terms of the NLPL-internal project structure, it is actually a task of its own (because there are additional on-line services connected to OPUS), hence there is a separate wiki page with additional information on OPUS.