CLIN 2005 Abstracts
  • Anagram-Key based Tokenizer Evaluation
    Martin Reynaert (Induction of Linguistic Knowledge, Universiteit van Tilburg)
    A tokenizer has to be able to deal with widely differing phenomena depending on electronic texts' type or original format. Within the framework of the STEVIN project D-CoI we aim to build a pilot-corpus of 50 million words of written Dutch text from widely diverging sources. Sentence splitting/tokenization is one of the first pre-processing steps these texts need to undergo and mistakes made in this step compromise all subsequent annotation layers. We therefore evaluate 5 extant tokenizers for Dutch with the aim of combining their strengths. In the absence of prior work on automatic tokenizer evaluation methodology, we present our own solution and the actual evaluation results.

    The evaluator is based on anagram key-hashing which provides a cheap and suitable abstraction from the actual character strings. In a first step, the evaluator determines how many end-of-line characters and spaces are to be added/subtracted on the basis of the input text and the gold standard tokenized text. It then determines for each tokenizer's output, first how many of these additions have been effected and, second, if these additions were in actual fact performed at the correct locations. It is in this second step that the anagram key hash values come into their own: without the need for performing actual string comparisons, the anagram-key values allow for determining for one or possibly more lines of tokenizer output whether the line(s) match with the expected gold standard line.

    In the absence of gold standards covering all possible tokenization phenomena, the module allows for inter-tokenizer output comparison. This enables the automatic identification of inter-tokenizer disagreements and thereby facilitates the guided compilation of more comprehensive tokenizer benchmark evaluation sets.