Difference between revisions of "Parsing/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Segmentation)
(Parsing Systems)
 
(17 intermediate revisions by 5 users not shown)
Line 2: Line 2:
  
 
An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
 
An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU).
Initially, the software and data are commissioned on the Norwegian Abel supercluster.
+
The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster.
The Uppsala Parser is publicly available at https://github.com/UppsalaNLP/uuparser.
+
The software is available on the Norwegian Saga cluster
Note that the version installed here may exhibit some slight differences, designed to improve ease of use.
 
  
= Data =
+
Initially, software and data were commissioned on the Norwegian Abel supercluster, see [http://wiki.nlpl.eu/index.php/Parsing/abel The Abel page] for legacy information.
  
For parsing experiments data from the [Univeral Dependencies (UD) project](http://universaldependencies.org/), for a high number of languages. We provide the data in version 2.0, which was also used for the CoNLL shared task 2017, and version 2.1.
+
= Preprocessing Tools =
  
All data is available on Abel at <code>/projects/nlpl/data/parsing/universal_dependencies</code>
+
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
  
== UD version 2.0 ==
+
Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al.
 +
are available through the NLPL installations of the
 +
[http://nltk.org Natural Language Processing Toolkit (NLTK)] and the
 +
[https://spacy.io spaCy: Natural Language Processing in Python] tools.
  
folders:<br>
+
= Parsing Systems =
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.0-conll2017</code><br>
 
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-test-v2.0-conll2017</code>
 
  
info:<br>
+
* [http://wiki.nlpl.eu/index.php/Parsing/uuparser The Uppsala Parser]
Version 2.0 treebanks, archived at http://hdl.handle.net/11234/1-1983. <br>
+
* [http://wiki.nlpl.eu/index.php/Parsing/udpipe UDPipe]
70 treebanks, 50 languages, released March 1, 2017.<br>
+
* [http://wiki.nlpl.eu/index.php/Parsing/turboparser TurboParser]
Test data 2.0 are archived at http://hdl.handle.net/11234/1-2184. <br>
 
81 treebanks, 49 languages, released May 18, 2017.
 
  
Release 2.0 has test data released separately from the test data,
 
which is reflected in our folder structure. This data was released for
 
the CoNLL 2017 shared task.
 
  
== UD version 2.1 ==
+
Additionallly, parsers are available in several toolkits installed by nlpl: [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp StanfordNLP], [https://www.nltk.org/ NLTK], [https://spacy.io/ spaCy].
  
folders:<br>
+
= Training and Evaluation Data =
<code>/projects/nlpl/data/parsing/universal_dependencies/ud-treebanks-v2.1</code>
 
  
info:<br>
+
* [http://wiki.nlpl.eu/index.php/Parsing/ud Universal Dependencies v2.0–2.5]
Version 2.1 treebanks are available at http://hdl.handle.net/11234/1-2515. <br>
+
* [http://wiki.nlpl.eu/index.php/Parsing/sdp Semantic Dependency Parsing]
102 treebanks, 60 languages, released November 15, 2017.
 
 
 
 
 
 
 
 
 
= Using the Uppsala Parser =
 
 
 
* Log into Abel
 
* Activate the NLPL module repository:
 
module use -a /projects/nlpl/software/modulefiles/
 
* Load the most recent version of the uuparser module:
 
module load uuparser
 
 
 
== Training a parsing model ==
 
 
 
To train a set of parsing models on treebanks from Universal Dependencies (v2.1):
 
 
 
uuparser --include [languages to include denoted by their ISO id] --outdir my-output-directory
 
 
 
If you want to quickly test the parser is correctly loaded and running without waiting for the full training procedure, add the <code>--mini</code> flag.
 
For example:
 
 
 
uuparser --include "sv en ru" --outdir ~/experiments --mini
 
 
 
will train separate models for UD Swedish, English and Russian and store the results in the <code>experiments</code> folder in your home directory. The <code>--mini</code> flag tells the parser to train on just the first 150 sentences of each language, and evaluate on the first 100 sentences of development data. It also tells the parser to train for just 3 epochs, as opposed to the default 30 (see more below under "Options")
 
 
 
Model selection is included in the training process by default; that is, at each epoch the current model is evaluated on the UD dev data, and at the end of training the best performing model for each language is selected.
 
 
 
== Predicting with a pre-trained parsing model ==
 
 
 
To predict on UD test data with the models trained above:
 
 
 
uuparser --include "sv en ru" --outdir ~/experiments --predict
 
 
 
You may again include the <code>--mini</code> flag if you prefer to test on a subset of 50 test sentences.
 
 
 
== Options ==
 
 
 
The parser has numerous options to allow you to fine-control its behaviour. For a full list, type
 
 
 
uuparser --help | less
 
 
 
We recommend you set the <code>--dynet-mem</code> option to a larger number when running the full training procedure on larger treebanks. Commonly used values are 5000 and 10000 (in MB). Dynet is the neural network library on which the parser is built.
 
 
 
Note that due to random initialization and other non-deterministic elements in the training process, you will not obtain the same results even when training twice under exactly the same circumstances (e.g. languages, number of epochs etc.). To ensure identical results between two runs, we recommend setting the <code>--dynet-seed</code> option to the same value both times (e.g. <code> --dynet-seed 123456789</code>) and adding the <code>--use-default-seed</code> flag. This ensures that Python's random number generator and Dynet both produce the same sequence of random numbers.
 
 
 
== Training a multilingual parsing model ==
 
 
 
Our parser supports multilingual parsing, that is a single parsing model for one or more languages. To train a ''multilingual'' model for the three languages in the examples above, we simply add the <code>--multilingual</code> flag when training
 
 
 
uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000
 
 
 
In this case, instead of creating three separate models in the language-specific subdirectories within ~/experiments, a single model will be created directly in this folder. Predicting on test data is then as easy as:
 
 
 
uuparser --include "sv en ru" --outdir ~/experiments --multiling --dynet-mem 5000 --predict
 
 
 
Note that if you want to have different output directories for training and predicting, the <code>--modeldir</code> option can be specified when predicting to tell the parser where the pre-trained model can be found.
 
 
 
== Segmentation ==
 
 
 
In the above examples, we assume pre-segmented input data already in the [http://universaldependencies.org/format.html CONLL-U] format. If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using <code>module load nlpl-udpipe</code> and then run by typing <code>udpipe</code> at the command line, see below.
 
 
 
= Using UDPipe =
 
 
 
UDPipe is available as a module on Abel. It was installed as part of the OPUS activity.
 
 
 
 
 
How to use UDPipe:
 
* Log into Abel
 
* Activate the NLPL module repository:
 
module use -a /projects/nlpl/software/modulefiles/
 
* Load the most recent version of the uuparser module:
 
module load nlpl-udpipe
 
 
 
To learn more about using UDPipe, check the official [UDPipe User's Manual](http://ufal.mff.cuni.cz/udpipe/users-manual)
 
 
 
= Contact =
 
Aaron Smith, Uppsala University, firstname.lastname@lingfil.uu.se<br>
 
Sara Stymne, Uppsala University, firstname.lastname@lingfil.uu.se<br>
 

Latest revision as of 07:18, 15 January 2020

Background

An experimentation environment for data-driven dependency parsing is maintained for NLPL under the coordination of Uppsala University (UU). The data is available on the Norwegian Saga cluster and on the Finnish Puhti cluster. The software is available on the Norwegian Saga cluster

Initially, software and data were commissioned on the Norwegian Abel supercluster, see The Abel page for legacy information.

Preprocessing Tools

Additionally, a variety of tools for sentence splitting, tokenization, lemmatization, et al. are available through the NLPL installations of the Natural Language Processing Toolkit (NLTK) and the spaCy: Natural Language Processing in Python tools.

Parsing Systems


Additionallly, parsers are available in several toolkits installed by nlpl: StanfordNLP, NLTK, spaCy.

Training and Evaluation Data