Difference between revisions of "Parsing/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
Line 1: Line 1:
 
= Background =
 
= Background =
  
An experimentation environment for data-driven dependency parsing
+
An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
is maintained for NLPL under the coordination of Uppsala University (UU).
 
 
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
 
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
 +
The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
 +
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.
  
 
= Using the Uppsala Parser =
 
= Using the Uppsala Parser =
Line 13: Line 14:
 
  module load uuparser
 
  module load uuparser
  
'''Train a parsing model'''
+
== Training a parsing model ==
  
 
To train a set of parsing models on treebanks from Universal Dependencies (v2.1):
 
To train a set of parsing models on treebanks from Universal Dependencies (v2.1):
Line 19: Line 20:
 
  uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory
 
  uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory
  
for example
+
If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the <code>--mini</code> flag.
 +
For example:
  
  uuparser --include "sv en ru" --outdir ~/experiments
+
  uuparser --include "sv en ru" --outdir ~/experiments --mini
 +
 
 +
will train separate models for UD Swedish, English and Russian and store the results in the <code>experiments</code> folder in your home directory. The <code>--mini</code> flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")
  
will train separate models for UD Swedish, English and Russian and store the results in the ''experiments'' folder in your home directory.
 
 
Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.  
 
Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.  
  
'''Predicting with a pre-trained parsing model'''
+
== Predicting with a pre-trained parsing model ==
  
 
To predict on UD test data with the models trained above:
 
To predict on UD test data with the models trained above:
  
 
  uuparser --include "sv en ru" --outdir ~/experiments --predict
 
  uuparser --include "sv en ru" --outdir ~/experiments --predict
 +
 +
You may again include the <code>--mini</code> flag if you prefer to test on a subset of 50 test sentences.
 +
 +
== Options ==
 +
 +
The parser has numerous options to allow you to fine-control its behaviour. For a full list, type
 +
 +
uuparser --help | less
 +
 +
We recommend you set the <code>--dynet-mem</code> option to a larger number when training on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.
 +
 +
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
 +
 +
== Training a multilingual parsing model ==
 +
 +
Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multilingual</code> flag when training
 +
 +
uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000
 +
 +
In this case, instead of creating three separate models in the language-specific subdirectories created within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:
 +
 +
uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict
 +
 +
Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.
  
 
'''Contact:'''
 
'''Contact:'''
 
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se
 
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se

Revision as of 15:34, 1 December 2017

Background

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU). Initially, the software and data are commissioned on the Norwegian Abel supercluster. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser. Note that the version installed here may exhibit some slight differences, designed to improve ease of use.

Using the Uppsala Parser

  • Log into Abel
  • Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/
  • Load the most recent version of the uuparser module:
module load uuparser

Training a parsing model

To train a set of parsing models on treebanks from Universal Dependencies (v2.1):

uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory

If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the --mini flag. For example:

uuparser --include "sv en ru" --outdir ~/experiments --mini

will train separate models for UD Swedish, English and Russian and store the results in the experiments folder in your home directory. The --mini flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")

Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.

Predicting with a pre-trained parsing model

To predict on UD test data with the models trained above:

uuparser --include "sv en ru" --outdir ~/experiments --predict

You may again include the --mini flag if you prefer to test on a subset of 50 test sentences.

Options

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

uuparser --help | less

We recommend you set the --dynet-mem option to a larger number when training on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.

Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the --dynet-seed option to the same value both times (e.g. --dynet-seed 123456789) and adding the --use-default-seed flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.

Training a multilingual parsing model

Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a multilingual model for the three languages in the examples above, we simply add the --multilingual flag when training

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000

In this case, instead of creating three separate models in the language-specific subdirectories created within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:

uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict

Note that if you want to have different output directories for training and predicting, the --modeldir option can be specified when predicting to tell the parser where the pre-trained model can be found.

Contact: Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se