CLIN 2005 Abstracts
  • Automatic Diacritic Placemenent using Memory-Based Learning: a Gikuyu Language Case Study
    Guy De Pauw (CNTS Language Technology Group, University of Antwerp)
    Peter W. Wagacha (School of Computing & Informatics, University of Nairobi)
    Pauline W. Githinji (School of Computing & Informatics, University of Nairobi)
    Like most Bantu languages, the orthography of the Gikuyu language is quite straightforward with a one-to-one phoneme to grapheme mapping. To represent all seven vowels however, it introduces two extra diacritic-marked graphemes into its alphabet: the vowels i-tilde and u-tilde which represent distinct phonemes from those represented by the unmarked 'i' and 'u' graphemes. These characters are however not readily available on standard computer keyboards and are usually represented as the nearest available character. This common practice of using the unmarked graphemes to represent both graphemic variants, can render reading and understanding written texts more difficult. We investigated a system that is able to automatically place these diacritic characters in Gikuyu text on the basis of local graphemic context.

    For other languages using diacritics, such as German, this type of task can typically be handled by a simple lexicon lookup procedure that translates words without diacritics into the properly annotated format. This type of information source is however not digitally available for most Bantu-languages that use diacritics, including Gikuyu. We propose a grapheme-based approach that tries to predict the placement of diacritic characters using a memory-based learning classifier. The experiments show that this approach achieves a very high accuracy even with a limited amount of digitally available textual data. The experiments on Gikuyu are contrasted with experiments on French, German and Dutch.