Processing Natural Language Queries using NP-Chunking

An De Sitter (University of Antwerp)

In recent years, the use of natural language queries (NLQs) has become
an important topic in Information Retrieval (as shown e.g. by
contributions to conferences such as TREC).  With the growth of the
Internet, an increasing number of people not familiar with Boolean
queries has started using search engines. These users would be helped
by search engines allowing NLQs. When a user inserts an English
sentence as a NLQ, we see that not the whole sentence is of importance
to the search engine. Looking at example-queries, it seems that the
important parts of the sentence are the nouns with associated
words. This suggests that NP-chunking would be a good method to
select the appropriate words. 

We developed a system that uses NP-chunking followed by a postprocessing module to filter out non-relevant NPs (e.g. I, what, me, where,...). The system also assigns relevance to the remaining NPs and splits Boolean forms. The result is a vector of query terms accompanied with weights and possibly separated by Boolean operators. This vector has to be passed to the search engine.

In a first informal evaluation we processed a set of queries we found at the Ask Jeeves-homepage ( We looked at which terms were passed to the search engine and compared this using AltaVista. By means of the "word count" and "ignored" features of AltaVista, we can see which words/word groups AltaVista found to be important in the query. First results show great promise for the chunking-approach. The most important problem is the recognition of negation. We propose a solution for this problem by using a hybrid system with templates. Finally, we suggest to add a module in which the user has the possibility to alter his query visually.