A Machine Learning Approach to Phonemic Corpus Annotation
Veronique Hoste (University of Antwerp)
Steven Gillis (University of Antwerp)
Walter Daelemans (University of Antwerp)
In this paper we describe results of the application of Machine Learning
Techniques to automatic phonemic transcription of Dutch texts. The task to
be accomplished in the context of the "Corpus Gesproken Nederlands" (CGN)
consists of generating a phonemic transcription starting from an
orthographic transcription of recordings of spoken Dutch. Texts represent
both the Northern Dutch and the Flemish variants, being highly formal
speech, sloppy speech and intermediate level.
Since both variants of Dutch are present, the CELEX lexical database and
the FONILEX database were selected as training material. The first part of
the paper consists of a comprehensive qualitative and quantitative
comparison of the phonemic transcriptions in CELEX and FONILEX. More
specifically, the relationships between CELEX and FONILEX will be explored
on the basis of the output of rule-induction techniques such as
transformation-based error-driven learning (Brill) and C5.0 (Quinlan).
The second part of the paper deals with the actual generation of phonemic
transcriptions. In generating the transcriptions, different approaches were
The success rates of these approaches will be compared and discussed.
training on CELEX spelling-transcription pairs for Dutch transcription
and on FONILEX spelling-transcription pairs for Flemish;
training on CELEX spelling-transcription pairs for Dutch, and on CELEX
transcription - FONILEX transcription pairs for Flemish.
training on FONILEX spelling-transcription pairs for Flemish, and on
FONILEX transcription - CELEX transcription pairs for Dutch.