CLIN 2005 Abstracts
  • Techniques for a hybrid MT system
    Peter Dirix (Centre for Computational Linguistics, KU Leuven)
    Vincent Vandeghinste (Centre for Computational Linguistics, KU Leuven)
    Ineke Schuurman (Centre for Computational Linguistics, KU Leuven)
    In machine translation (MT), rule-based systems are still making errors, in spite of years of work on hand-crafted rules. Newer systems, like statistical and example-based MT produce a different kind of errors, and would need ever growing parallel corpora to improve their output. In many cases, parallel corpora are not at disposal, which rules out these newer techniques as well, asking for another approach using only a monolingual target-language corpus.

    Statistical and example-based systems usually do not involve linguistic notions. Cutting up sentences in linguistically sound subunits improves the quality of the translation. Demarcating clauses, verb groups, noun phrases, and prepositional phrases delimits the number of possible translation and hence also the search space.

    Our system represents each level of the sentence structure (sentence, clauses, phrases) with bags, i.e. unordered sets of tokens. After tokenizing, lemmatizing, and chunking the source-language sentence, all lemmas are translated using a bilingual dictionary. The target-language corpus is preprocessed the same way as the source language and templates of each level and collocation statistics are stored in a database, thus creating a language model for the target language. By bottom-up matching the different translated items and higher-level structure with the database information, one or more plausible translated sentences are constructed. Probabilities are generated according to occurrence of lemmas and collocations in the corpus.