Background

When parsing naturally-occuring text, it will often be necessary to pre-process input documents to various degrees—depending on what kinds of inputs are supported by a specific parsing system. Sentence splitting and tokenization, for example, are traditionally viewed as mandatory preprocessing steps prior to parsing. While fully integrated ‘document parsing’ systems have become more and more common in recent years, some older parsing systems may further expected their pre-segmented input to be tagged with parts of speech and maybe even lemmatized.

This NLPL module bundles a selection of preprocessing tools that the UiO team have repeatedlyfound to provide a good balance of accuracy and simplicity over a relatively broad range of different types of input documents.

Tokenization

The Regular Expression–Based PreProcessor (REPP) implements a cascade of string-level rewrite rules for input normalization and segmentation. REPP defines a simple declarative specification language and provides support for modularization and parameterization of collections of rewrite rule, as well as limited support for grouping of rules and iteration. Furthermore, REPP keeps track of character offsets into the original document for its output segments, including across one-to-many substitutions, insertions, and deletions.

Dridan & Oepen (2012; ACL) evaluate the performance of the English REPP tokenizer (developed by Stephan Oepen as part of the English Resource Grammar, ERG) against a handful of widely used tokenizers and observe that REPP delivers state-of-the-art performance in terms of the segmentation conventions defined by the venerable Penn Treebank (PTB).

module load nlpl-repp
cat /etc/bashrc | repp -c $LAPROOT/repp/erg/ptb.set

Parsing/repp

Background

Tokenization

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools