Difference between revisions of "Corpora/OPUS"
(→http://opus.nlpl.eu) |
(→http://opus.nlpl.eu) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
== http://opus.nlpl.eu == | == http://opus.nlpl.eu == | ||
− | OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 | + | OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2. |
− | * Information for [http://opus.nlpl.eu/trac | + | For instructions on how to access the data and use the tools, check: |
+ | * Information for [[#NLPL Users|NLPL Users]] | ||
+ | |||
+ | More detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]: | ||
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources | * Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources | ||
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats] | * Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats] | ||
Line 14: | Line 17: | ||
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots) | Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots) | ||
+ | |||
+ | |||
+ | === NLPL Users === | ||
+ | |||
+ | The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2]. | ||
+ | |||
+ | If you have access to those systems then you will be able to access the data from the file system: | ||
+ | |||
+ | <pre>on puhti: /projappl/nlpl/data/OPUS/ | ||
+ | on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre> | ||
+ | |||
+ | On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus: | ||
+ | |||
+ | <ul> | ||
+ | <li>Activate the NLPL module repository: | ||
+ | <pre>module use -a /projappl/nlpl/software/modules/etc # Puhti | ||
+ | module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre> | ||
+ | </li> | ||
+ | <li>Load the OPUS module: | ||
+ | <pre>module load nlpl-opus</pre> | ||
+ | </li> | ||
+ | </ul> | ||
+ | |||
+ | With this, you will have access to essential tools that make it easier to read and process the data sets. |
Latest revision as of 12:52, 2 March 2020
http://opus.nlpl.eu
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2.
For instructions on how to access the data and use the tools, check:
- Information for NLPL Users
More detailed information can be found on the OPUS Wiki:
- Information about the OPUS API for finding resources
- Information about data formats
- Information about tools
- Information about on-line interfaces
- Information about word alignment and the alignment lexicon
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php
Contact: Jörg Tiedemann via e-mail - firstname.lastname at helsinki.fi (first name without dots)
NLPL Users
The OPUS corpus is now hosted at CSC, the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the puhti shell. The core data is also available on the Norwegian cluster saga provided by sigma2.
If you have access to those systems then you will be able to access the data from the file system:
on puhti: /projappl/nlpl/data/OPUS/ on saga: /projects/nlpl/data/OPUS/ (only raw XML data)
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:
- Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc # Puhti module use -a /cluster/shared/nlpl/software/modules/etc # Saga
- Load the OPUS module:
module load nlpl-opus
With this, you will have access to essential tools that make it easier to read and process the data sets.