Parsing/turboparser

From Nordic Language Processing Laboratory
Jump to: navigation, search

TurboParser

TurboParser is a fast and accurate pre-neural dependency parser with linear programming. The package also contains a POS tagger, a semantic role labeler, a entity tagger, a coreference resolver, and a constituent (phrase-based) parser. For full documentation, see http://www.cs.cmu.edu/~ark/TurboParser/ and https://github.com/andre-martins/TurboParser. This document will only describe how to use the tagger and parser.

Using TurboParser on Saga

  • Log into Saga
  • Activate the NLPL module repository:
module use -a /cluster/shared/nlpl/software/modules/etc
  • Load the most recent version of the uuparser module:
module load nlpl-turboparser/2.3.0


Data formats and conversion

TurboParser takes CoNLL-X files as input. Most dcurrent data, including the universal dependencies data available on Saga is in CoNLLU format. If you want to use such data, you first need to convert between these formats, using the provided script (this script comes from Universal dependenices tools by Dan Zeman, and is included in the TurboParser module):

conllu_to_conllx.pl < INPUT_FILE.conllu > OUTPUT_FILE.conll

TurboTagger accepts data in native format with one word per line, and two columns, the words and the tags. There is a script for conversion between CoNLL-X and this format:

create_tagging_corpus.sh INPUT_FILE.conll

which will create the file "INPUT_FILE.conll.tagging"


Using the parser

To train a parsing models on a treebank:

TurboParser --train --file_train=$res_dir/TRAINING_DATA.conll --file_model=MODEL --logtostderr


To predict using the model trained above:

TurboParser --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr 

This command writes the aprsed data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)

The parser has numerous options to allow you to fine-control its behaviour. For a full list, type

TurboParser --help 


Using the tagger

To train a tagging model on a treebank:

TurboTagger --train --file_train=TRAINING_DATA.conll.tagging --file_model=MODEL --form_cutoff=1 --logtostderr

To predict using the model trained above:

TurboTagger --test --evaluate --file_model=MODEL --file_test=TEST_INPUT_FILE --file_prediction=RESULT_FILE --logtostderr 

This command writes the tagged data in RESULT_FILE, and prints the accuracy (since the --evaluate flag is given)

The tagger has numerous options to allow you to fine-control its behaviour. For a full list, type

TurboTagger --help 

Segmentation

In the above examples, we assume pre-segmented input data already in the appropriate format (see above). If your input is raw text, we recommend using UDPipe to segment first. The UDPipe module can be loaded using module load nlpl-udpipe and then run by typing udpipe at the command line, see UDPipe