Feature Merging for Maximum Entropy-based Parse Selection

Tony Mullen (Rijksuniversiteit Groningen)
Miles Osborne (Rijksuniversiteit Groningen)

Ambiguity is a common problem in NLP.  It is well accepted that as the 
coverage of computational grammars increases, the inherent ambiguity of 
natural language words and sentences becomes increasingly difficult to 
reckon with.  Wide coverage grammars generally assign far too many 
parses to sentences, many of which would never even occur to human 
speakers of the language.  A common and effective way of dealing with 
this problem has been through statistical means.  

We discuss an approach based on the "maximum entropy technique" of building a model distribution over parses which is the most uniform possible given constraints derived from statistical features collected from the training data and weighted according to their frequency in the data. The major difficulty with this approach lies in the problem of overfitting due to too many features being considered. This results in an unduly high probability being assigned to parses actually in the training data and an unduly low probability being assigned to parses not specifically trained upon. Our strategy of feature merging involves combining multiple features which commonly occur together into one more general feature. We discuss methods of doing this which may be used to reduce the size of the model, resulting in increased generality and overfitting reduction.