Difference between revisions of "Parsing/uuparser"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Training a multilingual parsing model)
Line 4: Line 4:
 
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.
 
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.
  
== Using the Uppsala Parser on Abel ==
+
== Using the Uppsala Parser on Saga ==
  
* Log into Abel
+
* Log into Saga
 
* Activate the NLPL module repository:
 
* Activate the NLPL module repository:
  module use -a /projects/nlpl/software/modulefiles/
+
  module use -a /cluster/shared/nlpl/software/modules/etc
 
* Load the most recent version of the uuparser module:
 
* Load the most recent version of the uuparser module:
  module load nlpl-uuparser
+
  module load nlpl-uuparser/2.3.0
  
 
== Training a parsing model ==
 
== Training a parsing model ==
Line 40: Line 40:
 
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
 
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
  
== Training a multilingual parsing model ==
+
== Training a multitreebank parsing model ==
  
Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training
+
Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages)
  
 
  uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000
 
  uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000

Revision as of 13:03, 13 January 2020

The Uppsala Parser

The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser. Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

Using the Uppsala Parser on Saga

  • Log into Saga
  • Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
  • Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.0

Training a parsing model

To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):

uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory

For example:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments

will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the experiments folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

Predicting with a pre-trained parsing model

To predict on UD test data with the models trained above:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict

Options

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the --dynet-mem option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the --dynet-seed option to the same value both times (e.g. --dynet-seed 123456789) and adding the --use-default-seed flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

Training a multitreebank parsing model

Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a multitreebank model for the three languages in the examples above, we simply add the --multiling flag when training (note, though, that these models have so far been found to work best for groups of related languages)

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000

In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:

uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict

Note that if you want to have different output directories for training and predicting, the --modeldir option can be specified when predicting to tell the parser where the pre-trained model can be found.

Segmentation

In the above examples, we assume pre-segmented input data already in the CONLL-U format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using module load nlpl-udpipe and then run by typing udpipe at the command line, see below.