Mining Subcategorization Information by Using Multiple Feature Loglinear Models

Nuno M. C. Marques (New University of Lisbon)
J. Gabriel P. Lopes (New University of Lisbon)
C. A. Coelho (New University of Lisbon)


In this paper we will show how several non-independent features can be
conjugated in order to characterize the word
subcategorization. Elsewhere [Marques et al. 1998a,b] we have showed
that loglinear models are able to learn a set of clusters based on
occurrence of a single relevant feature extracted from corpora. A
fully automatically tagged corpus with part-of-speech information was
used to build a system capable of clustering (in an unsupervised
fashion) the words subcategorizing the same type of syntactical
structures. There a method for unsupervised learning statistical
models for words with the same subcategorization frame, using huge
collections of fully automatically part-of-speech tagged texts was
presented.

Here we will show how several interacting features may be conjoined in a single loglinear statistical model. We will focus on the problem of verbal transitivity, since it was there that we found the need for the introduction of more features: the main problem was the exchange of the traditional Subject-Verb-Object order that occurs in some Portuguese verbs. After a brief description of the framework on learning subcategorization by using loglinear models, we will present a comparative study on three possible feature sets. Several loglinear models with and without interactions among features and scores will be used. The results of the clustering process will be evaluated based on the independent classification presented in commercial dictionaries. Improvements where particularly noticeable in intransitive verbs where the precision raised from 82% to 91%. Conclusions will be drawn regarding the advantages and problems of introducing more features in a particular loglinear model.

The acquired results are presently paving the way to newer parsing mechanisms, capable of automatically overcome lack of subcategorization in lexica.