CLIN 2005 Abstracts
  • Error Detection and Automated Lexical Acquisition for Open Text Processing
    Yi Zhang (Dept. of Computational Linguistics, Saarland University)
    Valia Kordoni (Dept. of Computational Linguistics, Saarland University)
    This paper presents a corpus-driven approach towards unknown words processing for deep grammars. Our motivation is to enhance the robustness of deep processing in order to enable open texts processing. Close investigations have shown that a large portion of the parsing failures are due to the incompleteness of the lexical information. The coverage problem of the grammar can be largely alleviated with a better lexicon. Instead of building a larger static lexicon, we proposed to build a statistical model that can generate new lexical entries on the fly. This consists of mainly two parts: (i) the missing lexicon is identified with the error mining techniques described in (van Noord 2004); (ii) the new lexical entries are generated using a maximum entropy model based classifier. The classifier is trained with a corpus annotated with atomic lexical types, and predicts the type of the new lexical entry. Various features are evaluated for their contributions. Also, the full parsing and disambiguation results are used as feedback to improve the precision of the model. The experiment is carried out for the Dutch Alpino Grammar and the LinGO English Resource Grammar. The promising results show the approach can be adapted to different deep grammars of various languages.