Difference between revisions of "Parsing/repp"
(→Tokenization) |
(→Tokenization) |
||
Line 32: | Line 32: | ||
Stephan Oepen as part of the | Stephan Oepen as part of the | ||
[http://lingo.stanford.edu/erg English Resource Grammar], ERG) | [http://lingo.stanford.edu/erg English Resource Grammar], ERG) | ||
− | against a handful of widely used tokenizers and | + | against a handful of widely used tokenizers for English and |
observe that REPP delivers state-of-the-art performance in terms of the | observe that REPP delivers state-of-the-art performance in terms of the | ||
segmentation conventions defined by the venerable Penn Treebank (PTB). | segmentation conventions defined by the venerable Penn Treebank (PTB). | ||
Line 39: | Line 39: | ||
module load nlpl-repp | module load nlpl-repp | ||
cat /etc/bashrc | repp -c $LAPROOT/repp/erg/ptb.set | cat /etc/bashrc | repp -c $LAPROOT/repp/erg/ptb.set | ||
+ | </pre> | ||
+ | |||
+ | = Sentence Splitting = | ||
+ | |||
+ | The standard REPP tokenizers (for PTB- and NDT-style segmentation) | ||
+ | assume isolated sentences as their input, i.e. presuposse sentence | ||
+ | splitting (some tokenization rules are, in fact, sensitive to adjacent | ||
+ | sentence boundaries). | ||
+ | The <code>nlpl-repp</code> module includes a sentence splitter (based | ||
+ | on a combination of regular expressions and manually curated lists of | ||
+ | abbreviations) that has been found to work well for English and | ||
+ | Norwegian. | ||
+ | [http://www.aclweb.org/anthology/C12-2096 Read et al. (2012; COLING)] | ||
+ | evaluate an array of sentence splitting tools over a range of | ||
+ | different document collections and find that the ''tokenizer'' tool | ||
+ | provides the overall best sentence splitting results. | ||
+ | [https://www.duo.uio.no/bitstream/handle/10852/59276/Teaching_NLTK_Norwegian.pdf Bjerke-Lindstrøm (2017; UiO)] | ||
+ | echoes these observations in a contrastive study of Norwegian sentence | ||
+ | splitters. | ||
+ | |||
+ | <pre> | ||
+ | cat /etc/bashrc | tokenizer -L en-u8 -S -N -p -P -x -E '' | ||
</pre> | </pre> |
Revision as of 20:21, 7 January 2019
Background
When parsing naturally-occuring text, it will often be necessary to pre-process input documents to various degrees—depending on what kinds of inputs are supported by a specific parsing system. Sentence splitting and tokenization, for example, are traditionally viewed as mandatory preprocessing steps prior to parsing. While fully integrated ‘document parsing’ systems have become more and more common in recent years, some older parsing systems may further expected their pre-segmented input to be tagged with parts of speech and maybe even lemmatized.
This NLPL module bundles a selection of preprocessing tools that the UiO team have repeatedlyfound to provide a good balance of accuracy and simplicity over a relatively broad range of different types of input documents.
Tokenization
The Regular Expression–Based PreProcessor (REPP) implements a cascade of string-level rewrite rules for input normalization and segmentation. REPP defines a simple declarative specification language and provides support for modularization and parameterization of collections of rewrite rule, as well as limited support for grouping of rules and iteration. Furthermore, REPP keeps track of character offsets into the original document for its output segments, including across one-to-many substitutions, insertions, and deletions.
Dridan & Oepen (2012; ACL) evaluate the performance of the English REPP tokenizer (developed by Stephan Oepen as part of the English Resource Grammar, ERG) against a handful of widely used tokenizers for English and observe that REPP delivers state-of-the-art performance in terms of the segmentation conventions defined by the venerable Penn Treebank (PTB).
module load nlpl-repp cat /etc/bashrc | repp -c $LAPROOT/repp/erg/ptb.set
Sentence Splitting
The standard REPP tokenizers (for PTB- and NDT-style segmentation)
assume isolated sentences as their input, i.e. presuposse sentence
splitting (some tokenization rules are, in fact, sensitive to adjacent
sentence boundaries).
The nlpl-repp
module includes a sentence splitter (based
on a combination of regular expressions and manually curated lists of
abbreviations) that has been found to work well for English and
Norwegian.
Read et al. (2012; COLING)
evaluate an array of sentence splitting tools over a range of
different document collections and find that the tokenizer tool
provides the overall best sentence splitting results.
Bjerke-Lindstrøm (2017; UiO)
echoes these observations in a contrastive study of Norwegian sentence
splitters.
cat /etc/bashrc | tokenizer -L en-u8 -S -N -p -P -x -E ''