A Machine Learning Approach to Phonemic Corpus Annotation

Veronique Hoste (University of Antwerp)
Steven Gillis (University of Antwerp)
Walter Daelemans (University of Antwerp)

In this paper we describe results of the application of Machine Learning
Techniques to automatic phonemic transcription of Dutch texts. The task to
be accomplished in the context of the "Corpus Gesproken Nederlands" (CGN)
consists of generating a phonemic transcription starting from an
orthographic transcription of recordings of spoken Dutch. Texts represent
both the Northern Dutch and the Flemish variants, being highly formal 
speech, sloppy speech and intermediate level.

Since both variants of Dutch are present, the CELEX lexical database and the FONILEX database were selected as training material. The first part of the paper consists of a comprehensive qualitative and quantitative comparison of the phonemic transcriptions in CELEX and FONILEX. More specifically, the relationships between CELEX and FONILEX will be explored on the basis of the output of rule-induction techniques such as transformation-based error-driven learning (Brill) and C5.0 (Quinlan).

The second part of the paper deals with the actual generation of phonemic transcriptions. In generating the transcriptions, different approaches were adopted:

  1. training on CELEX spelling-transcription pairs for Dutch transcription and on FONILEX spelling-transcription pairs for Flemish;
  2. training on CELEX spelling-transcription pairs for Dutch, and on CELEX transcription - FONILEX transcription pairs for Flemish.
  3. training on FONILEX spelling-transcription pairs for Flemish, and on FONILEX transcription - CELEX transcription pairs for Dutch.
The success rates of these approaches will be compared and discussed.