CLIN 2005 Abstracts
  • Lexical representation in supervised machine learning
    Bart Decadt (CNTS - Language Technology Group, University of Antwerp)
    Walter Daelemans (CNTS - Language Technology Group, University of Antwerp)
    The representation chosen for words in supervised learning of language processing tasks, has potentially a large impact on generalization accuracy because of sparse data problems. In this study we systematically investigate two dimensions of lexical representation - abstract versus concrete, and internal versus external representations in two supervised Natural Language Processing tasks: prepositional phrase (PP) attachment and word sense disambiguation.

    The limits of the abstractness dimension are the words themselves as their only representation (concrete) and part of speech (POS) tags (abstract), with in-between representations such as attenuated words, lemmatized words, WordNet synsets, modified value difference measures (MVDM), and class labels from clustering vector-based representations of words extracted from large corpora. Some of these representations are internal, i.e. they are constructed from the training data (words, lemmatized words, POS tags, MVDM values, attenuated representations), while others are external, i.e. they add information from external sources (WordNet synsets and class labels from clustering).

    We show the effect of these two fundamental dimensions of lexical representation in supervised learning of two disambiguation tasks (PP attachment and word sense disambiguation), and discuss the interaction with optimization of algorithm parameters in learning.

    Our experiments in PP attachment - with Ratnaparkhi's (1994) English training and test set - show two tendencies: (i) after optimization, using an internal and least abstract representation (words, lemmas) results in the highest classification accuracy; and (ii) a more abstract and external representation (synsets, class labels from clustering) is less sensitive to the optimization of algorithm parameters. We are currently running experiments in WSD with the training and test data of the English Senseval-3 lexical sample task, to check whether these tendencies are confirmed, or not: the outcome of these experiments will be presented during our talk.