Nordic Language Processing Laboratory - User contributions [en]

Parsing/home

2020-01-15T07:18:55Z

Sara: /* Parsing Systems */

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster.
The software is available on the Norwegian Saga cluster

Initially, software and data were commissioned on the Norwegian Abel supercluster, see [http://wiki.nlpl.eu/index.php/Parsing/abel The Abel page] for legacy information.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]

Additionallly, parsers are available in several toolkits installed by nlpl: [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp StanfordNLP], [https://www.nltk.org/ NLTK], [https://spacy.io/ spaCy].

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Parsing/home

2020-01-15T07:17:50Z

Sara: /* Parsing Systems */

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster.
The software is available on the Norwegian Saga cluster

Initially, software and data were commissioned on the Norwegian Abel supercluster, see [http://wiki.nlpl.eu/index.php/Parsing/abel The Abel page] for legacy information.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]
* Additional parsers: [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp StanfordNLP], [https://www.nltk.org/ NLTK], [https://spacy.io/ spaCy]. For the parsers in these toolkits we refer to the official documentation.

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Infrastructure/software/catalogue

2020-01-14T21:34:50Z

Sara: /* Activity C: Data-Driven Parsing */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-a89691f] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_20191218] || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20191218] || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.2.0] || Stanford NLP Neural Pipeline || Saga || ? || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser/2.3.1] || Uppsala Parser || Saga,Abel || December 2019 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/turboparser nlpl-turboparser/2.3.0] || TurboParser || Saga|| January 2020 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Saga, Puhti,Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Parsing/home

2020-01-14T21:29:54Z

Sara: /* Parsing Systems */

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster.
The software is available on the Norwegian Saga cluster

Initially, software and data were commissioned on the Norwegian Abel supercluster, see [http://wiki.nlpl.eu/index.php/Parsing/abel The Abel page] for legacy information.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]
* Additional parsers: StanfordNLP, NLTK, spaCy

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Parsing/udpipe

2020-01-14T21:03:18Z

Sara: /* Running UDPipe */

= UDPipe =

UDPipe is an end-to-end system for morphosyntactic parsing in the UD framework developed by Milan Straka. It was used as the baseline system in the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. This page only contains a brief introduction, for the full capability of UDPipe, please see the official [http://ufal.mff.cuni.cz/udpipe UDPipe Web page] including the [http://ufal.mff.cuni.cz/udpipe/users-manual UDPipe User's Manual] and

== Using UDPipe on Puhti and Saga ==

UDPipe is available as a module on Puhti and Saga. It was installed as part of the OPUS activity.

How to use UDPipe on Saga:
* Log into Saga
* Activate the NLPL module repository:
module use -a /custer/shared/nlpl/software/modules/etc
* Load the most recent version of the udpipe module:
module load nlpl-udpipe/1.2.1-devel

How to use UDPipe on Puhti:
* Log into Puhti
* Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc
* Load the most recent version of the udpipe module:
module load nlpl-udpipe/1.2.1-devel

== Pre-trained models ==

There are pre-trained models available for all languages in Universal Dependencies 2.4:
/cluster/shared/nlpl/software/modules/udpipe/latest/models (Saga)
/projappl/nlpl/software/modules/udpipe/latest/models (Puhti)

== Running UDPipe ==

To run UDPipe on raw text (in TESTFILE) run the following command:
udpipe --tokenize --tag --parse MODEL_DIR/MODEL TESTFILE

where MODEL_DIR is specified above, and MODEL is the model for the language (treebank) in question, e.g. swedich_talbanken-ud-2.4-190531.udpipe, for Swedish, tained on Talbanken.

The above command performs segmentation, tokenization, POS-tagging and parsing. If only a subset of these tasks are needed, remove some of the flags. The output file is in CoNLLU format as default.

If you want to parse with uuparser, run the following command (--tag is optional, depending on if you want POS-tags). Then you can run uuparser on the resulting file, for parsing.
udpipe --tokenize --tag MODEL_DIR/MODEL TESTFILE

UDPipe can handle several input and output formats and other variations. It is also possible to train new models. See the official manual for more details.

Parsing/udpipe

2020-01-14T20:51:30Z

Sara:

= UDPipe =

UDPipe is an end-to-end system for morphosyntactic parsing in the UD framework developed by Milan Straka. It was used as the baseline system in the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. This page only contains a brief introduction, for the full capability of UDPipe, please see the official [http://ufal.mff.cuni.cz/udpipe UDPipe Web page] including the [http://ufal.mff.cuni.cz/udpipe/users-manual UDPipe User's Manual] and

== Using UDPipe on Puhti and Saga ==

UDPipe is available as a module on Puhti and Saga. It was installed as part of the OPUS activity.

How to use UDPipe on Saga:
* Log into Saga
* Activate the NLPL module repository:
module use -a /custer/shared/nlpl/software/modules/etc
* Load the most recent version of the udpipe module:
module load nlpl-udpipe/1.2.1-devel

How to use UDPipe on Puhti:
* Log into Puhti
* Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc
* Load the most recent version of the udpipe module:
module load nlpl-udpipe/1.2.1-devel

== Pre-trained models ==

There are pre-trained models available for all languages in Universal Dependencies 2.4:
/cluster/shared/nlpl/software/modules/udpipe/latest/models (Saga)
/projappl/nlpl/software/modules/udpipe/latest/models (Puhti)

== Running UDPipe ==

To run UDPipe on raw text (in TESTFILE) run the following command:
udpipe --tokenize --tag --parse MODEL_DIR/MODEL TESTFILE

where MODEL_DIR is specified above, and MODEL is the model for the language (treebank) in question, e.g. swedich_talbanken-ud-2.4-190531.udpipe, for Swedish, tained on Talbanken.

The above command performs segmentation, tokenization, POS-tagging and parsing. If only a subset of these tasks are needed, remove some of the flags. The output file is in CoNLLU format as default.

UDPipe can handle several input and output formats and other variations. It is also possible to train new models. See the official manual for more details.

Parsing/uuparser

2020-01-14T20:41:08Z

Sara:

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.1

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Pre-trained models ==

There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR)

The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well.

The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy.

The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts.

To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank.

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk

== POS-tags ==

The uuparser gives good results without POS-tags, mainly thanks to the use of character embeddings, and this is the default setting. It can be used with predicted POS-tags. If you wish to do this, we recommend using UDPipe to predict tags. To activate a POS-tag embedding, use the flag --pos-emb-size N, where N is the size of the embedding (12 has been a useful value).

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/udpipe

2020-01-14T20:23:59Z

Sara:

Parsing/uuparser

2020-01-14T20:16:50Z

Sara: /* Using the Uppsala Parser on Saga */

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.1

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Pre-trained models ==

There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR)

The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well.

The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy.

The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts.

To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank.

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/uuparser

2020-01-14T20:09:25Z

Sara:

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.0

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Pre-trained models ==

There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR)

The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well.

The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy.

The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts.

To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank.

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/uuparser

2020-01-14T20:08:55Z

Sara:

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.0

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Pre-trained models

There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR)

The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well.

The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy.

The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts.

To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank.

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/uuparser

2020-01-14T19:58:17Z

Sara: /* Segmentation */

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.0

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/uuparser

2020-01-14T19:52:32Z

Sara: /* Training a multitreebank parsing model */

= The Uppsala Parser =

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

For the full documentation, see the github page.

== Using the Uppsala Parser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.0

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multitreebank parsing model ==

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:

uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5

Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.

To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:

uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken

This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe].

Parsing/uuparser

2020-01-14T19:48:26Z

Sara: /* Predicting with a pre-trained parsing model */

Parsing/uuparser

2020-01-14T19:48:00Z

Sara:

Parsing/uuparser

2020-01-14T19:46:45Z

Sara: /* Predicting with a pre-trained parsing model */

Parsing/uuparser

2020-01-14T13:07:04Z

Sara: /* Training a parsing model */

Parsing/uuparser

2020-01-14T13:06:20Z

Sara: /* Training a multitreebank parsing model */

Parsing/uuparser

2020-01-14T12:51:15Z

Sara: /* Segmentation */

Parsing/uuparser

2020-01-14T12:50:28Z

Sara: /* Training a multitreebank parsing model */

Parsing/uuparser

2020-01-14T12:19:17Z

Sara: /* The Uppsala Parser */

Parsing/turboparser

2020-01-14T12:15:57Z

Sara: Created page with "= TurboParser = TurboParser is a fast and accurate pre-neural dependency parser with linear programming. The package also contains a POS tagger, a semantic role labeler, a en..."

= TurboParser =

TurboParser is a fast and accurate pre-neural dependency parser with linear programming. The package also contains a POS tagger, a semantic role labeler, a entity tagger, a coreference resolver, and a constituent (phrase-based) parser. For full documentation, see [http://www.cs.cmu.edu/~ark/TurboParser/ http://www.cs.cmu.edu/~ark/TurboParser/] and [https://github.com/andre-martins/TurboParser https://github.com/andre-martins/TurboParser]. This document will only describe how to use the tagger and parser.

== Using TurboParser on Saga ==

* Log into Saga
* Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
* Load the most recent version of the uuparser module:
module load nlpl-turboparser/2.3.0

== Data formats and conversion ==

TurboParser takes CoNLL-X files as input. Most dcurrent data, including the universal dependencies data available on Saga is in CoNLLU format. If you want to use such data, you first need to convert between these formats, using the provided script (this script comes from Universal dependenices tools by Dan Zeman, and is included in the TurboParser module):

conllu_to_conllx.pl < INPUT_FILE.conllu > OUTPUT_FILE.conll

TurboTagger accepts data in native format with one word per line, and two columns, the words and the tags. There is a script for conversion between CoNLL-X and this format:

create_tagging_corpus.sh INPUT_FILE.conll

which will create the file "INPUT_FILE.conll.tagging"

== Using the parser ==

To train a parsing models on a treebank:

TurboParser --train --file_train=$res_dir/TRAINING_DATA.conll --file_model=MODEL --logtostderr

To predict using the model trained above:

TurboParser --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr

This command writes the aprsed data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

TurboParser --help

== Using the tagger ==

To train a tagging model on a treebank:

TurboTagger --train --file_train=TRAINING_DATA.conll.tagging --file_model=MODEL --form_cutoff=1 --logtostderr

To predict using the model trained above:

TurboTagger --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr

This command writes the tagged data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)

The tagger has numerous options to allow you to fine-control its behaviour. For a full list, type

TurboTagger --help

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the appropriate format (see above). If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Parsing/abel

2020-01-14T10:22:32Z

Sara:

= Background =

'''This page is outdated and kept for documentation purposes only! It reflects the status of the translation activity mid-2019, before the launch of Puhti and Saga.'''

This page describes resources previously installed on the Abel cluster. For unlinked resources, see pages for currently available software/data on the main parsing page.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/repp REPP Tokenizer (English and Norwegian)]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/dozat Stanford Graph-Based Parser by Tim Dozat]
* uuparser
* UDPipe

= Data =

* Universal Dependencies treebanks
* Semantic Dependency parsing

Parsing/home

2020-01-14T10:20:42Z

Sara:

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster.
The software is available on the Norwegian Saga cluster

Initially, software and data were commissioned on the Norwegian Abel supercluster, see [http://wiki.nlpl.eu/index.php/Parsing/abel The Abel page] for legacy information.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Parsing/home

2020-01-14T10:18:01Z

Sara:

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Parsing/abel

2020-01-14T10:16:39Z

Sara: Created page with "= Background = '''This page is outdated and kept for documentation purposes only! It reflects the status of the translation activity mid-2019, before the launch of Puhti and ..."

= Background =

'''This page is outdated and kept for documentation purposes only! It reflects the status of the translation activity mid-2019, before the launch of Puhti and Saga.'''

This page describes resources previously installed on the Abel cluster.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/repp REPP Tokenizer (English and Norwegian)]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/dozat Stanford Graph-Based Parser by Tim Dozat]

Parsing/uuparser

2020-01-13T13:03:00Z

Sara:

Parsing/home

2020-01-13T12:54:23Z

Sara:

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.

= Preprocessing Tools =

* [http://wiki.nlpl.eu/index.php/Parsing/repp REPP Tokenizer (English and Norwegian)]

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
are available through the NLPL installations of the
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
[https://spacy.io spaCy: Natural Language Processing in Python] tools.

= Parsing Systems =

* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
* [http://wiki.nlpl.eu/index.php/Parsing/dozat Stanford Graph-Based Parser by Tim Dozat]

= Training and Evaluation Data =

* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]

Parsing/ud

2019-12-17T10:23:15Z

Sara:

= Universal Dependencies =

For syntactic parsing experiments we provide data from the [http://universaldependencies.org/ Universal Dependencies (UD) project] for a high number of languages. The data is provided in v2.0, which was used for the CoNLL shared task 2017, v2.1, v2.2, which was used for the CoNLL shared task 2018, v2.3, v2.4, and v2.5.

All data is available on Saga at <code>/cluster/shared/nlpl/data/parsing/ud</code> and automatically
[http://wiki.nlpl.eu/index.php/Infrastructure/replication replicated] to Puhti into <code>/projappl/nlpl/data/parsing/ud<code>.

== UD version 2.0 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.0-conll2017</code> 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-test-v2.0-conll2017</code>

info: 
Version 2.0 treebanks, archived at http://hdl.handle.net/11234/1-1983. 
70 treebanks, 50 languages, released March 1, 2017. 
Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 
81 treebanks, 49 languages, released May 18, 2017.

Release 2.0 has test data released separately from the test data,
which is reflected in our folder structure. This data was released for
the CoNLL 2017 shared task.

== UD version 2.1 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.1</code>

info: 
Version 2.1 treebanks are available at http://hdl.handle.net/11234/1-2515. 
102 treebanks, 60 languages, released November 15, 2017.

== UD version 2.2 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.2</code>

info: 
Version 2.2 treebanks are available at http://hdl.handle.net/11234/1-2837. 
122 treebanks, 71 languages, released July 1, 2018.

== UD version 2.3 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.3</code>

info: 
Version 2.2 treebanks are available at http://hdl.handle.net/11234/1-2895. 
129 treebanks, 76 languages, released November 15, 2018.

== UD version 2.4 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.4</code>

info: 
Version 2.4 treebanks are available at http://hdl.handle.net/11234/1-2988. 
146 treebanks, 83 languages, released May 15, 2019. 

== UD version 2.5 ==

folders: 
<code>/cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5</code>

info: 
Version 2.5 treebanks are available at http://hdl.handle.net/11234/1-3105. 
157 treebanks, 90 languages, released November 15, 2019. 

= Contact =
Joakim Nivre, Uppsala University 
Sara Stymne, Uppsala University 
firstname.lastname@lingfil.uu.se

Parsing/home

2017-12-08T09:10:18Z

Sara: /* Segmentation */

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

= Data =

For parsing experiments data from the [Univeral Dependencies (UD) project](http://universaldependencies.org/), for a high number of languages. We provide the data in version 2.0, which was also used for the CoNLL shared task 2017, and version 2.1.

All data is available on Abel at <code>/projects/nlpl/data/parsing/universal_dependencies</code>

== UD version 2.0 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.0-conll2017</code> 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-test-v2.0-conll2017</code>

info: 
Version 2.0 treebanks, archived at http://hdl.handle.net/11234/1-1983. 
70 treebanks, 50 languages, released March 1, 2017. 
Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 
81 treebanks, 49 languages, released May 18, 2017.

Release 2.0 has test data released separately from the test data,
which is reflected in our folder structure. This data was released for
the CoNLL 2017 shared task.

== UD version 2.1 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.1</code>

info: 
Version 2.1 treebanks are available at http://hdl.handle.net/11234/1-2515. 
102 treebanks, 60 languages, released November 15, 2017.

= Using the Uppsala Parser =

* Log into Abel
* Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
* Load the most recent version of the uuparser module:
module load uuparser

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.1):

uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory

If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the <code>--mini</code> flag.
For example:

uuparser --include "sv en ru" --outdir ~/experiments --mini

will train separate models for UD Swedish, English and Russian and store the results in the <code>experiments</code> folder in your home directory. The <code>--mini</code> flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")

Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv en ru" --outdir ~/experiments --predict

You may again include the <code>--mini</code> flag if you prefer to test on a subset of 50 test sentences.

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multilingual parsing model ==

Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multilingual</code> flag when training

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see below.

= Using UDPipe =

UDPipe is available as a module on Abel. It was installed as part of the OPUS activity.

How to use UDPipe:
* Log into Abel
* Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
* Load the most recent version of the uuparser module:
module load nlpl-udpipe

To learn more about using UDPipe, check the official [UDPipe User's Manual](http://ufal.mff.cuni.cz/udpipe/users-manual)

= Contact =
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se 
Sara Stymne, Uppsala University, firstname.lastname@lingfil.uu.se

Parsing/home

2017-12-08T09:09:48Z

Sara:

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

= Data =

For parsing experiments data from the [Univeral Dependencies (UD) project](http://universaldependencies.org/), for a high number of languages. We provide the data in version 2.0, which was also used for the CoNLL shared task 2017, and version 2.1.

All data is available on Abel at <code>/projects/nlpl/data/parsing/universal_dependencies</code>

== UD version 2.0 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.0-conll2017</code> 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-test-v2.0-conll2017</code>

info: 
Version 2.0 treebanks, archived at http://hdl.handle.net/11234/1-1983. 
70 treebanks, 50 languages, released March 1, 2017. 
Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 
81 treebanks, 49 languages, released May 18, 2017.

Release 2.0 has test data released separately from the test data,
which is reflected in our folder structure. This data was released for
the CoNLL 2017 shared task.

== UD version 2.1 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.1</code>

info: 
Version 2.1 treebanks are available at http://hdl.handle.net/11234/1-2515. 
102 treebanks, 60 languages, released November 15, 2017.

= Using the Uppsala Parser =

* Log into Abel
* Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
* Load the most recent version of the uuparser module:
module load uuparser

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.1):

uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory

If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the <code>--mini</code> flag.
For example:

uuparser --include "sv en ru" --outdir ~/experiments --mini

will train separate models for UD Swedish, English and Russian and store the results in the <code>experiments</code> folder in your home directory. The <code>--mini</code> flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")

Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv en ru" --outdir ~/experiments --predict

You may again include the <code>--mini</code> flag if you prefer to test on a subset of 50 test sentences.

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multilingual parsing model ==

Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multilingual</code> flag when training

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

== Segmentation ==

In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line.

= Using UDPipe =

UDPipe is available as a module on Abel. It was installed as part of the OPUS activity.

How to use UDPipe:
* Log into Abel
* Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
* Load the most recent version of the uuparser module:
module load nlpl-udpipe

To learn more about using UDPipe, check the official [UDPipe User's Manual](http://ufal.mff.cuni.cz/udpipe/users-manual)

= Contact =
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se 
Sara Stymne, Uppsala University, firstname.lastname@lingfil.uu.se

Parsing/home

2017-12-08T08:57:37Z

Sara:

= Background =

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

= Data =

For parsing experiments data from the Univeral Dependencies (UD) project, for a high number of languages. We provide the data in version 2.0, which was also used for the CoNLL shared task 2017, and version 2.1.

All data is available on Abel at <code>/projects/nlpl/data/parsing/universal_dependencies</code>

== UD version 2.0 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.0-conll2017</code> 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-test-v2.0-conll2017</code>

info: 
Version 2.0 treebanks, archived at http://hdl.handle.net/11234/1-1983. 
70 treebanks, 50 languages, released March 1, 2017. 
Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. 
81 treebanks, 49 languages, released May 18, 2017.

Release 2.0 has test data released separately from the test data,
which is reflected in our folder structure. This data was released for
the CoNLL 2017 shared task.

== UD version 2.1 ==

folders: 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.1</code>

info: 
Version 2.1 treebanks are available at http://hdl.handle.net/11234/1-2515. 
102 treebanks, 60 languages, released November 15, 2017.

= Using the Uppsala Parser =

* Log into Abel
* Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
* Load the most recent version of the uuparser module:
module load uuparser

== Training a parsing model ==

To train a set of parsing models on treebanks from Universal Dependencies (v2.1):

uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory

If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the <code>--mini</code> flag.
For example:

uuparser --include "sv en ru" --outdir ~/experiments --mini

will train separate models for UD Swedish, English and Russian and store the results in the <code>experiments</code> folder in your home directory. The <code>--mini</code> flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")

Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

== Predicting with a pre-trained parsing model ==

To predict on UD test data with the models trained above:

uuparser --include "sv en ru" --outdir ~/experiments --predict

You may again include the <code>--mini</code> flag if you prefer to test on a subset of 50 test sentences.

== Options ==

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the <code>--dynet-mem</code> option to a larger number when training on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

== Training a multilingual parsing model ==

Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multilingual</code> flag when training

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000

In this case, instead of creating three separate models in the language-specific subdirectories created within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict

Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.

'''Contact:'''
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se