CLIN 2005 Abstracts
  • SVMs As A Data Selection Method
    Francisco Borges (Alfa-Informatica, RuG)
    In this paper we show the use of SVMs (Support Vector Machines) as a data pre-selection method for cases where the amount of available data is too large to be entirely used by the machine learning algorithm of choice. The way this issue is normally dealt with is to randomly select the data, and we propose the use of SVMs as a more 'principled' method for the task.

    The specific problem we analyze is the selection of parses generated by the Alpino parser and treebank. The amount of data available for training parse selection models is prohibitively too large to be used by both SVMs or Maximum Entropy methods; with SVMs due to its training times and with Maximum Entropy due to its use of RAM memory.

    SVMs training works by estimating a hyperplane based decision function with only a fraction of the training data. The instances used in the hyperplane estimation are refereed as "support vectors". In this paper we show that by pre-training smaller models with SVM, and therefore identifying support vectors, and by then giving preference to these when building the final (SVM) model, we can boost the performance of it.

    It remains to be checked if we can also increase the performance of Maximum Entropy models pre-selecting data in this way.