Difference between revisions of "Parsing/uuparser"
(Created page with "= The Uppsala Parser = To use the Uppsala parser: * Log into Abel * Activate the NLPL module repository: module use -a /projects/nlpl/software/modulefiles/ * Load the most ...") |
|||
(23 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= The Uppsala Parser = | = The Uppsala Parser = | ||
− | + | The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser. | |
+ | Note that the version installed here may exhibit some slight differences, designed to improve ease of use. | ||
− | * Log into | + | For the full documentation, see the github page. |
+ | |||
+ | == Using the Uppsala Parser on Saga == | ||
+ | |||
+ | * Log into Saga | ||
* Activate the NLPL module repository: | * Activate the NLPL module repository: | ||
− | module use -a / | + | module use -a /cluster/shared/nlpl/software/modules/etc |
* Load the most recent version of the uuparser module: | * Load the most recent version of the uuparser module: | ||
− | module load nlpl-uuparser | + | module load nlpl-uuparser/2.3.1 |
== Training a parsing model == | == Training a parsing model == | ||
− | To train a set of parsing models on treebanks from Universal Dependencies (v2. | + | To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later): |
− | uuparser --include [languages to include denoted by their | + | uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir |
− | |||
For example: | For example: | ||
− | uuparser --include " | + | uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 |
− | will train separate models | + | will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the <code>experiments</code> folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected. |
− | |||
− | Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected. | ||
== Predicting with a pre-trained parsing model == | == Predicting with a pre-trained parsing model == | ||
Line 28: | Line 30: | ||
To predict on UD test data with the models trained above: | To predict on UD test data with the models trained above: | ||
− | uuparser --include " | + | uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 |
− | + | To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE: | |
+ | |||
+ | uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict | ||
== Options == | == Options == | ||
Line 42: | Line 46: | ||
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers. | Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers. | ||
− | == Training a | + | == Training a multitreebank parsing model == |
− | Our parser supports | + | Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a ''multitreebank'' model for the three languages in the examples above, we simply add the <code>--multiling</code> flag when training (note, though, that these models have so far been found to work best for groups of related languages) |
− | uuparser --include " | + | uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 |
− | In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then | + | In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then: |
− | uuparser --include " | + | uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 |
Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found. | Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found. | ||
+ | |||
+ | If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways: | ||
+ | |||
+ | uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 | ||
+ | |||
+ | uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5 | ||
+ | |||
+ | Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space. | ||
+ | |||
+ | To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank: | ||
+ | |||
+ | uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken | ||
+ | |||
+ | This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank. | ||
+ | |||
+ | == Pre-trained models == | ||
+ | |||
+ | There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR) | ||
+ | |||
+ | The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well. | ||
+ | |||
+ | The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy. | ||
+ | |||
+ | The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts. | ||
+ | |||
+ | To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank. | ||
+ | |||
+ | uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk | ||
+ | |||
+ | |||
+ | == POS-tags == | ||
+ | |||
+ | The uuparser gives good results without POS-tags, mainly thanks to the use of character embeddings, and this is the default setting. It can be used with predicted POS-tags. If you wish to do this, we recommend using UDPipe to predict tags. To activate a POS-tag embedding, use the flag --pos-emb-size N, where N is the size of the embedding (12 has been a useful value). | ||
== Segmentation == | == Segmentation == | ||
− | In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see | + | In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]. |
Latest revision as of 20:41, 14 January 2020
Contents
The Uppsala Parser
The Uppsala Parser is a neural transition-based dependency parser based on bist-parser by Eli Kiperwasser and Yoav Goldberg and developed primarily in the context of the CoNLL shared tasks on universal dependency parsing in 2017 and 2018. The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser. Note that the version installed here may exhibit some slight differences, designed to improve ease of use.
For the full documentation, see the github page.
Using the Uppsala Parser on Saga
- Log into Saga
- Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
- Load the most recent version of the uuparser module:
module load nlpl-uuparser/2.3.1
Training a parsing model
To train a set of parsing models on treebanks from Universal Dependencies (v2.2 or later):
uuparser --include [languages to include denoted by their treebank id] --outdir my-output-directory --datadir ud-treebank-dir
For example:
uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
will train separate models on UD Swedish-Talbanken, UD English-ParTUT and UD Russian-SynTagRus and store the results in the experiments
folder in your home directory. Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.
Predicting with a pre-trained parsing model
To predict on UD test data with the models trained above:
uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
To predict on other texts, prepare the data into the CoNLLU format (see below) and run the following command, assuming the model trained above, for the file TESTFILE:
uuparser --testfile TESTFILE --outdir ~/experiments --modeldir ~/experiments/sv_talbanken --predict
Options
The parser has numerous options to allow you to fine-control its behaviour. For a full list, type
uuparser --help | less
We recommend you set the --dynet-mem
option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the --dynet-seed
option to the same value both times (e.g. --dynet-seed 123456789
) and adding the --use-default-seed
flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
Training a multitreebank parsing model
Our parser supports multitreebank parsing, that is a single parsing model for one or more treebanks, which could be from the same or form different languages. To train a multitreebank model for the three languages in the examples above, we simply add the --multiling
flag when training (note, though, that these models have so far been found to work best for groups of related languages)
uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on UD test data is then:
uuparser --include "sv_talbanken en_partut ru_syntagrus" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
Note that if you want to have different output directories for training and predicting, the --modeldir
option can be specified when predicting to tell the parser where the pre-trained model can be found.
If you want to use the parser for a language or treebank that was not among the treebanks that the parser was trained for, this can be done in two ways:
uuparser --include "sv_pud:sv_talbanken" --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
uuparser --include "sv_pud" --forced-tbank-emb sv_talbanken --outdir ~/experiments --multiling --dynet-mem 5000 --predict --datadir /cluster/shared/nlpl/data/parsing/ud/ud-treebanks-v2.5
Where in both cases, sv_pud will be parsed using sv_talbanken as a proxy (i.e. parsing sv_pud as if it is is sv_talbanken). With the first option, multiple treebanks and proxies can be defined separated by space.
To predict on non-UD texts, the command is similar to the single treebank case, the mian difference being the use of the --multiling flag, and specifying a proxy treebank:
uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODELDIR --predict --multiling --forced-tbank-emb sv_talbanken
This commands runs the parser on TESTFILE (in CoNLLU format), using a model found in MODELDIR, and prints the resulting parse in a file in ~/experiments. In the example sv_talbanken is used as the proxy treebank.
Pre-trained models
There are three multilingual pre-trained models available, covering English and most of the Nordic languages. The models can be found in /cluster/shared/nlpl/software/modules/uuparser/2.3.1/models ($MODEL_DIR)
The English model is found in the subdirectory $MODEL_DIR/en. It is trained on the four English UD treebanks: en_gum en_partut en_ewt en_lines. For English web texts it is recommended to use en_ewt as a proxy. For more formal texts, en_partut typically works well.
The Scandinavian model is found in the subdirectory $MODEL_DIR/scandinavian. It is trained on six treebanks in Danish, Norwegian and Swedish: sv_talbanken sv_lines no_bokmaal no_nynorsk no_nynorsklia da_ddt. For Swedish, using sv_talbanken typically gives the best results, sv_lines might be better for fiction. For Norwegian, use no_bokmaal for Bokmål, no_nynorsk for general Nynorsk and no_nynorsklia for spoken Nynorsk. We have also had reasonable results when parsing Faroese with no_nynorsk as proxy.
The Uralic model is found in the subdirectory $MODEL_DIR/uralic. It is trained on five treebanks in Estonian, Finnish and North Sàmi: fi_ftb fi_tdt et_edt et_ewt sme_giella. For Finnish we recommend using fi_tdt as a proxy. For Estonian et_edt is probably most useful for general texts and et_ewt might be good for web texts.
To run with any of these models, use the same commands as above, exemplified with the Scandinavian model and no_nynorsk as a proxy treebank.
uuparser --testfile TESTFILE --outdir ~/experiments --modeldir MODEL_DIR/scandinavian --predict --multiling --forced-tbank-emb no_nynorsk
POS-tags
The uuparser gives good results without POS-tags, mainly thanks to the use of character embeddings, and this is the default setting. It can be used with predicted POS-tags. If you wish to do this, we recommend using UDPipe to predict tags. To activate a POS-tag embedding, use the flag --pos-emb-size N, where N is the size of the embedding (12 has been a useful value).
Segmentation
In the above examples, we assume pre-segmented input data already in the CONLL-U format. If your input is raw text, we recommend using UDPipe to segment and convert the format first. The UDPipe module can be loaded using module load nlpl-udpipe
and then run by typing udpipe
at the command line, see UDPipe.