Parsing/turboparser
Contents
TurboParser
TurboParser is a fast and accurate pre-neural dependency parser with linear programming. The package also contains a POS tagger, a semantic role labeler, a entity tagger, a coreference resolver, and a constituent (phrase-based) parser. For full documentation, see http://www.cs.cmu.edu/~ark/TurboParser/ and https://github.com/andre-martins/TurboParser. This document will only describe how to use the tagger and parser.
Using TurboParser on Saga
- Log into Saga
- Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
- Load the most recent version of the uuparser module:
module load nlpl-turboparser/2.3.0
Data formats and conversion
TurboParser takes CoNLL-X files as input. Most dcurrent data, including the universal dependencies data available on Saga is in CoNLLU format. If you want to use such data, you first need to convert between these formats, using the provided script (this script comes from Universal dependenices tools by Dan Zeman, and is included in the TurboParser module):
conllu_to_conllx.pl < INPUT_FILE.conllu > OUTPUT_FILE.conll
TurboTagger accepts data in native format with one word per line, and two columns, the words and the tags. There is a script for conversion between CoNLL-X and this format:
create_tagging_corpus.sh INPUT_FILE.conll
which will create the file "INPUT_FILE.conll.tagging"
Using the parser
To train a parsing models on a treebank:
TurboParser --train --file_train=$res_dir/TRAINING_DATA.conll --file_model=MODEL --logtostderr
To predict using the model trained above:
TurboParser --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr
This command writes the aprsed data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)
The parser has numerous options to allow you to fine-control its behaviour. For a full list, type
TurboParser --help
Using the tagger
To train a tagging model on a treebank:
TurboTagger --train --file_train=TRAINING_DATA.conll.tagging --file_model=MODEL --form_cutoff=1 --logtostderr
To predict using the model trained above:
TurboTagger --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr
This command writes the tagged data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)
The tagger has numerous options to allow you to fine-control its behaviour. For a full list, type
TurboTagger --help
Segmentation
In the above examples, we assume pre-segmented input data already in the appropriate format (see above). If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using module load nlpl-udpipe
and then run by typing udpipe
at the command line, see UDPipe