Bootstrapping Morphosyntactic Annotation for the Corpus of Spoken Dutch

Frank Van Eynde (K.U.Leuven)
Jakub Zavrel (University of Antwerp)
Walter Daelemans (University of Antwerp)


The Dutch-Flemish project `Corpus Gesproken Nederlands' (1998-2003) aims at
the construction of a corpus of 10 million words of spoken Dutch. The
corpus will include several layers of transcription (orthographic,
phonetic) and annotation (part-of-speech tags, syntactic analysis, prosodic
analysis). 

The talk will focus on the POS-tagging of the data. In the first part we will present the general characteristics of the POS tagging in CGN (word-by-word, automatic with human correction, conformity to EAGLES recommendations) and we will give an overview of the tagset (10 parts of speech with associated morpho-syntactic features). In the second part we will give a survey of the automatic taggers and lemmatizers for Dutch which are currently available (XEROX, TnT, MBT, MXPOST, Brill, KEPER, CORRie, D-Tale). None of these taggers use the desired CGN tagset. We will present the results of a comparative evaluation of these taggers with respect to their usefulness for bootstrapping and facilitating the annotation of the CGN. We present accuracy measurements of the taggers in terms of their own tagsets, results of mapping from a native tagset to the CGN tagset, results from training on a very small intial sample of the CGN, and results of combination of all participating taggers. The latter method shows superior accuracy over all other approaches. Among the single taggers, TnT was found to be the most useful and accurate in general.