CLIN 2005 Abstracts
  • Traditional Corpora and the Web as Corpus: the Italian Newspapers Case Study
    Mirko Tavosanis (Dipartimento di Studi Italianistici, Pisa University)
    The emergence of the "Web as corpus" theme has prompted linguists to reconsider even traditional corpora and their value. Linguists have known since the beginning of the statistical analysis of corpora that many linguistic indicators, including the frequency of the most common words of a given language, can show wide fluctuations between different corpora (for the Italian situation see Bortolini, Tagliavini and Zampolli 1971; Voghera 1993). They know also that "the Web is not representative of anything else. But neither are other corpora, in any well-understood sense" (Kilgarriff and Grefenstette 2003). However, the quantity of materials published on the Web and the growing capabilities of search engines allow to query a significant, and growing, percentage of the whole of a textual type. In a future scenario, representativeness could then be replaced by a real-time monitoring of every single utterance, even in widely-used text types.

    To evaluate how far we are gone in this direction, and which are the linguistics hypothesis we can try at this date, the paper:

    a. describes the current range of Italian web papers currently searchable on the web; b. shows similarities and differences in basic linguistic indicators (relative frequency of demonstrative pronouns and adjectives) between the Coris/Codis subcorpus "stampa" and the widest selection of Web newspapers currently available through the Web; c. draws temporary conclusions and proposes a list of indicators for future developments.