CLIN 2005 Abstracts
  • Baseline Models for Modern Hebrew Parsing
    Reut Tsarfaty (ILLC, University of Amsterdam)
    Khalil Simaan (ILLC, University of Amsterdam)
    Hebrew, Arabic, and other Semitic languages have a rich morphology. Many affixes that are appended to the word carry substantial information and can belong to different syntactic categories. Therefore, a first step towards utterance understanding is to extract the different constituents that exist at the word level to allow for further processing (such as part-of-speech tagging and syntactic processing.) This step is typically called morphological analysis. Because of the large scale morphological ambiguity in Semitic languages already in the word level, each word-form might have multiple possible morphological analyses. Picking the correct morphological analysis is largely dependent on contextual information, which is best expressed on top of syntactic structures. However, syntactic analysis (parsing) can only proceed in as much as the sequence of morphological segments is in place. Thus, a suitable treatment of morphological analysis in Semitic languages demands a treatment of syntactic analysis and vice versa. In this talk we describe probabilistic baseline models for Modern Hebrew parsing that incorporate one form of morphological processing, namely segmentation (i.e., the task of identifying the different functional constituents that were concatenated together to form a MH word.) The baseline models allow different layers of processing (morphological analysis, part-of-speech tagging, and parsing) to interact in order to identify the fine-grained segments that are put together to forms words, phrases and sentences, and the syntactic structures that they form. In particular, we show that the span of words is not aligned with the span of syntactic phrases and draw attention to the specific challenges that such phenomenon poses for standard statistical parsing evaluation techniques.