CLIN 2005 Abstracts
  • Diffusing the Dutch d/dt Confusion
    Marco Tillemans (Tilburg University)
    Antal van den Bosch (Tilburg University)
    One of the most notorious typos in Dutch is using the incorrect verb inflection for a singular present tense verb form of which the stem ends in a "d", e.g., to confuse "word" for "wordt" or vice versa. This typo is a confusible error, i.e., the contextually inappropriate use of an existing word that is a close neighbor to the intended word. The error can hinder further NLP processing, and is generally a nuisance.

    We investigate d/dt confusible disambiguation as a classification task, in which our examples represent the left and right context of verb forms (represented by words only), labeled by the "d" or "dt" outcome appropriate for that context. Classifiers are developed in two variants. In the first variant, we develop a single classifier for every single pair of confusibles, forming an ensemble of confusible experts. In the second variant, we train a single classifier to predict whether the verb form in focus (which can be every stem ending in "d") receives a "d" or "dt" ending in the context presented as input.

    We report on learning curve experiments, based on up to several millions of examples drawn from large corpora of Dutch newspaper text, yielding very high accuracies. We discuss the differences between the performances of confusible experts and the monolithic classifier, and discuss the possibilities of fully-automatic d/dt confusible correction.