When parsing naturally-occuring text, it will often be necessary to pre-process input documents to various degrees—depending on what kinds of inputs are supported by a specific parsing system. Sentence splitting and tokenization, for example, are traditionally viewed as mandatory preprocessing steps prior to parsing. While fully integrated ‘document parsing’ systems have become more and more common in recent years, some older parsing systems may further expected their pre-segmented input to be tagged with parts of speech and maybe even lemmatized.
This NLPL module bundles a selection of preprocessing tools that the UiO team have repeatedlyfound to provide a good balance of accuracy and simplicity over a relatively broad range of different types of input documents.
The Regular Expression–Based PreProcessor (REPP) implements a cascade of string-level rewrite rules for input normalization and segmentation. REPP defines a simple declarative specification language and provides support for modularization and parameterization of collections of rewrite rule, as well as limited support for grouping of rules and iteration. Furthermore, REPP keeps track of character offsets into the original document for its output segments, including across one-to-many substitutions, insertions, and deletions.
Dridan & Oepen (2012; ACL) evaluate the performance of the English REPP tokenizer (developed by Stephan Oepen as part of the English Resource Grammar, ERG) against a handful of widely used tokenizers for English and observe that REPP delivers state-of-the-art performance in terms of the segmentation conventions defined by the venerable Penn Treebank (PTB).
module load nlpl-repp cat /etc/bashrc | repp -c $LAPROOT/repp/erg/ptb.set
The standard REPP tokenizers (for PTB- and NDT-style segmentation)
assume isolated sentences as their input, i.e. presuposse sentence
splitting (some tokenization rules are, in fact, sensitive to adjacent
nlpl-repp module includes a sentence splitter (based
on a combination of regular expressions and manually curated lists of
abbreviations) that has been found to work well for English and
Read et al. (2012; COLING)
evaluate an array of sentence splitting tools over a range of
different document collections and find that the tokenizer tool
provides the overall best sentence splitting results.
Bjerke-Lindstrøm (2017; UiO)
echoes these observations in a contrastive study of Norwegian sentence
cat /etc/bashrc | tokenizer -L en-u8 -S -N -p -P -x -E ''