CLIN 2005 Abstracts
  • PHASAR for phrase-based retrieval
    Cornelis H.A. Koster (Radboud University Nijmegen)
    We describe the rationale of the PHASAR system (Phrase-based Accurate Search And Retrieval), an Information Retrieval and Text Mining system based on linguistical principles. The PHASAR system provides a new way of searching in large document collections.

    The search engine is based on the use of Dependency Triples as terms. Both the documents and the queries are parsed, transduced to Dependency Triples and lemmatized. Queries consist of a set of Dependency Triples, whose elements can be generalized or specialized in order to achieve the desired precision and recall. The search process is supported by frequency information of co-occurring words and ontological information about synonyms, wider and narrower concepts to help tentative exploration.

    The mining of information about metabolites from the medical literature provides a difficult test. To that end we have parsed a snapshot of the Medline collection of abstracts (more than 16 million documents, 2.4 Giga words) into dependency triples and generated various indices.

    In the course of an interactive analysis of selected documents mentioning the metabolite, a profile is generated, consisting of (possibly weighted) query patterns joined by discourse operators, which can then be used for an exhaustive automatic analysis of a large body of literature. The profiles can afterwards be re-used for other metabolites or even to find (as yet unknown) candidate metabolite names. For each metabolite, all passages referring to its source, effects and interactions can be retrieved.

    The resulting system is generic in nature, being applicable (given suitable thesauri and ontologies) to many other forms of professional search.