<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.nlpl.eu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Figint</id>
	<title>Nordic Language Processing Laboratory - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.nlpl.eu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Figint"/>
	<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/Special:Contributions/Figint"/>
	<updated>2026-05-16T06:23:02Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.31.10</generator>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Home&amp;diff=626</id>
		<title>Home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Home&amp;diff=626"/>
		<updated>2019-02-04T16:13:13Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Resources */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The Nordic Language Processing Laboratory (NLPL) is a collaboration of&lt;br /&gt;
university research groups in Natural Language Processing (NLP) in Northern Europe.&lt;br /&gt;
Our vision is to implement a virtual laboratory for large-scale NLP research by&lt;br /&gt;
(a) creating new ways to enable  data- and compute-intensive  Natural Language Processing  research by implementing  a common software, data and service stack in multiple Nordic HPC centres,&lt;br /&gt;
(b) by pooling competencies within the user community and among expert support teams,&lt;br /&gt;
and (c) by enabling internationally competitive, data-intensive research and experimentation&lt;br /&gt;
on a scale that would be difficult to sustain on commodity computing resources.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
= Activities =&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
[[File:neic.png|center]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As part of its ‘virtual laboratory’, NLPL prepares and maintains&lt;br /&gt;
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue software] and data infrastructures for&lt;br /&gt;
(A) [http://wiki.nlpl.eu/index.php/Infrastructure/home Collaboration and Software Management];&lt;br /&gt;
(B) [http://wiki.nlpl.eu/index.php/Translation/home Statistical and Neural Machine Translation];&lt;br /&gt;
(C) [http://wiki.nlpl.eu/index.php/Parsing/home Data-Driven Dependency Parsing];&lt;br /&gt;
(D) [http://wiki.nlpl.eu/index.php/Corpora/home Very Large Corpora];&lt;br /&gt;
(E) [http://wiki.nlpl.eu/index.php/Vectors/home Pre-Trained Word Embeddings];&lt;br /&gt;
(F) [http://wiki.nlpl.eu/index.php/Evaluation/home Automated Extrinsic Evaluation];&lt;br /&gt;
(G) [http://wiki.nlpl.eu/index.php/Corpora/OPUS Parallel Corpora and OPUS]; and&lt;br /&gt;
(H) [http://wiki.nlpl.eu/index.php/Community/home Community Formation and Outreach].&lt;br /&gt;
Please see the [http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue catalogue of available software]&lt;br /&gt;
and above links for information on how to gain access to and utilize the NLPL virtual laboratory.&lt;br /&gt;
&lt;br /&gt;
= Resources =&lt;br /&gt;
&lt;br /&gt;
Since mid-2017, NLPL has started to make available some of its resources and services to the public:&lt;br /&gt;
&lt;br /&gt;
* [http://hdl.handle.net/11234/1-1989 90 billion tokens of ‘raw’ text] extracted from web data, covering the 45 languages in the 2017 UD Parsing Shared Task; [http://wiki.nlpl.eu/index.php/Corpora/home see here]&lt;br /&gt;
* The [http://corpora.nlpl.eu/engc3/ EngC3] corpus of some [http://corpora.nlpl.eu/engc3/ 130 billion tokens of clean English text] extraced from the Common Crawl;&lt;br /&gt;
* The [http://epe.nlpl.eu Extrinsic Parser Evaluation 2017] (EPE) Shared Task at the DepLing and IWPT 2017 conferences;&lt;br /&gt;
* A [http://vectors.nlpl.eu/repository repository of pre-trained word embeddings] on very large corpora and [http://vectors.nlpl.eu/explore on-line explorer] for these models;&lt;br /&gt;
* The [http://opus.nlpl.eu Open Parallel Corpus] (OPUS; now maintained as a dedicated service under the NLPL umbrella);&lt;br /&gt;
* An annual [http://wiki.nlpl.eu/index.php/Community/training winter school series] on machine learning and scientific programming for NLP research.&lt;br /&gt;
&lt;br /&gt;
= Partners =&lt;br /&gt;
&lt;br /&gt;
The NLPL consortium is comprised of Nordic research groups in NLP and&lt;br /&gt;
the national e-infrastructure providers of Finland and Norway:&lt;br /&gt;
Helsinki University (Finland), IT University Copenhagen (Denmark),&lt;br /&gt;
University of Copenhagen (Denmark), University of Oslo (Norway),&lt;br /&gt;
Turku University (Finland), and Uppsala University (Sweden) are the&lt;br /&gt;
academic partners.&lt;br /&gt;
&lt;br /&gt;
Between 2017 and 2020, NLPL is supported by the [https://neic.nordforsk.org/ Nordic e-Infrastructure Collaboration]&lt;br /&gt;
(NeIC) and the national e-Infrastructure providers in Finland ([http://www.csc.fi CSC]) and Norway ([https://www.sigma2.no/ Sigma2]).&lt;br /&gt;
&lt;br /&gt;
= Associates =&lt;br /&gt;
&lt;br /&gt;
NLPL welcomes involvement of additional research groups in Language Technology in the Nordics, including the Baltic region, to make use of the virtual laboratory. The project has established an associate program where users can get access to NLPL resources.&lt;br /&gt;
Please email the contact address below to ask for access.&lt;br /&gt;
As part of your initial contact, please provide an indication of the&lt;br /&gt;
expected types of computing, software, and data to be used and the&lt;br /&gt;
anticipated group of users (including details on affiliation).&lt;br /&gt;
&lt;br /&gt;
As of October 2018, the following research groups are NLPL associates:&lt;br /&gt;
&lt;br /&gt;
* [https://clasp.gu.se/ Center for Linguistic Theory and Studies of Probability] at Gothenburg University (Sweden)&lt;br /&gt;
* [https://www.ling.su.se/english/nlp Section for Computational Linguistics] at Stockholm University (Sweden)&lt;br /&gt;
* [https://nlp.cs.ut.ee/ Natural Language Processing Research Group] at the University of Tartu (Estonia)&lt;br /&gt;
&lt;br /&gt;
= Contact =&lt;br /&gt;
&lt;br /&gt;
To email NLPL project management and its Steering Group, please use the address &amp;lt;code&amp;gt;contact&amp;lt;/code&amp;gt;&amp;lt;code&amp;gt;@&amp;lt;/code&amp;gt;&amp;lt;code&amp;gt;nlpl.eu&amp;lt;/code&amp;gt;.&lt;br /&gt;
In mid-2017, the project welcomes expressions of interest from additional NLP research groups in Northern Europe.&lt;br /&gt;
&lt;br /&gt;
For additional background and the archive of official project documents (including the work plan and Steering Group minutes), please&lt;br /&gt;
see the [https://wiki.neic.no/wiki/Nordic_language_processing_laboratory NLPL page on the NeIC wiki].&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=625</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=625"/>
		<updated>2019-02-04T16:06:04Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic. A number of filtering and deduplication steps have been applied, to arrive at a relatively clean extracted text. The texts are segmented and fully parsed with the UDPipe v1.1 parser. The processing pipeline is described in greater detail in Section 2.2 of this paper: [http://universaldependencies.org/conll17/proceedings/pdf/K17-3001.pdf] This collection was used as supporting data in the 2017 and 2018 CoNLL shared tasks. In addition some of the word2vec and ELMo embeddings distributed within NLPL are based on this data.&lt;br /&gt;
&lt;br /&gt;
The raw, unprocessed version of the crawls will be available (at least) on Taito and re-parsing this data with one of the top-ranking parsers in the 2018 Shared Task is under consideration.&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=624</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=624"/>
		<updated>2019-02-04T16:04:29Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic. A number of filtering and deduplication steps have been applied, to arrive at a relatively clean extracted text. The texts are segmented and fully parsed with the UDPipe v1.1 parser. The processing pipeline is described in greater detail in Section 2.2 of this paper: [http://universaldependencies.org/conll17/proceedings/pdf/K17-3001.pdf] This collection was used as supporting data in the 2017 and 2018 CoNLL shared tasks. In addition some of the word2vec and ELMo embeddings distributed within NLPL are based on this data.&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=623</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=623"/>
		<updated>2019-02-04T15:59:11Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic. A number of filtering and deduplication steps have been applied, to arrive at a relatively clean extracted text. These are described in greater detail in Section 2.2 of this paper: [http://universaldependencies.org/conll17/proceedings/pdf/K17-3001.pdf]&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=622</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=622"/>
		<updated>2019-02-04T15:56:25Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
The collection consists of 90B words in 45 languages gathered from CommonCrawl and Wikipedia dumps. The data ranges from around 9B words for English to 28K words of Old Church Slavonic.&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=621</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=621"/>
		<updated>2019-02-04T15:55:02Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
The collection consists of 90B words in 45 languages.&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=620</id>
		<title>Corpora/home</title>
		<link rel="alternate" type="text/html" href="https://wiki.nlpl.eu/index.php?title=Corpora/home&amp;diff=620"/>
		<updated>2019-02-04T15:53:39Z</updated>

		<summary type="html">&lt;p&gt;Figint: /* Many Languages: The CoNLL 2017 Text Collection (Turku) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Background =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
NLPL creates and makes available various very large collections of textual data, for example drawing on Wikipedia and the Common Crawl.&lt;br /&gt;
&lt;br /&gt;
These data resources are available from the connected infrastructure, below the project directory&lt;br /&gt;
&amp;lt;tt&amp;gt;/projects/nlpl/data/corpora/&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;/proj/nlpl/data/corpora/&amp;lt;/tt&amp;gt; on Abel and Taito, respectively.&lt;br /&gt;
In early 2018, the NLPL corpora resources are not yet replicated across the two systems; please see&lt;br /&gt;
the summary table below to determine which resource is available on which system.&lt;br /&gt;
&lt;br /&gt;
= Corpus Catalogue =&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Directory !! Description !! System !! Install Date !! Maintainer&lt;br /&gt;
|-&lt;br /&gt;
| conll17 || [http://hdl.handle.net/11234/1-1989 Corpora for 46 Languages (from Wikipedia and Common Crawl)] || Abel, Taito || November 2017 || Jenna Kaverna&lt;br /&gt;
|-&lt;br /&gt;
| EngC3 || [http://urn.nb.no/URN:NBN:no-60569 130 Billion Tokens of ‘Clean’ Text from the Common Crawl] || Abel, Taito || April 2018 || Stephan Oepen&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The Collection of Open Parallel Corpora (Helsinki) =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Corpora/OPUS|OPUS]] provides parallel texts in many languages extracted from&lt;br /&gt;
a broad range of freely available sources&lt;br /&gt;
In terms of the NLPL-internal&lt;br /&gt;
project structure, it is actually a task of its own (because there are&lt;br /&gt;
additional on-line services connected to OPUS), hence there is a&lt;br /&gt;
[[Corpora/OPUS|separate wiki page]] with additional information on OPUS.&lt;br /&gt;
&lt;br /&gt;
= Many Languages: The CoNLL 2017 Text Collection (Turku) =&lt;br /&gt;
&lt;br /&gt;
xxx&lt;br /&gt;
&lt;br /&gt;
= English: 130 Billion Words Extracted from the Common Crawl (Oslo) =&lt;br /&gt;
&lt;br /&gt;
= English: Two Variants of Text Extraction from Wikipedia (Oslo) =&lt;/div&gt;</summary>
		<author><name>Figint</name></author>
		
	</entry>
</feed>