CLIN 2005 Abstracts
  • Combining Supervised and Unsupervised Natural Language Processing
    Rens Bod (Institute for Logic, Language and Computation, University of Amsterdam)
    Arguably the main reason for the dramatic improvement in natural language parsing during the last decade was the increasing availability of pre-annotated corpus data from which probabilistic grammars could be inferred. Yet, it has also become increasingly clear that all supervised corpus-based learning methods will reach an asymptote because of the imminent shortage of human annotations.

    At the same time, considerable progress has been made in unsupervised language learning, i.e. the induction of structured representations from unlabeled data. Although the performance of unsupervised methods remains behind that of supervised ones, they do not suffer from shortage of annotated corpus data.

    In this talk I shall discuss how the two approaches of supervised and unsupervised language processing can be properly combined so as to overcome the problem of data-sparseness. I will show how the all-subtrees approach known as data-oriented parsing (DOP) can be integrated with an unsupervised learning technique known as alignment-based learning (ABL). The resulting model employs both an unparsed and a parsed corpus to analyze new input. ABL is used to induce structures and subtrees from the unparsed corpus, while DOP is used to extract subtrees from the parsed corpus. By treating ABL-subtrees as previously unseen events, we can employ a discounted frequency estimator to divide probability mass over the two sets of subtrees. This effectively results in a model that parses new data by the ABL-subtrees only if either the sentence cannot be parsed by DOP-subtrees or if the total probability of generating a DOP-subtree by ABL-subtrees is higher than the observed relative frequency of the DOP-subtree.

    I will show how this integrated model can be defined in Bayesian terms which maximizes the conditional probability of a tree T given a sentence W, an unparsed corpus UC and a parsed corpus PC. Finally, I will go into some preliminary experiments with this model.