Neural-Network based Grapheme to Phoneme conversion

Ivelin Stoianov (Rijksuniversiteit Groningen)
John Nerbonne (Rijksuniversiteit Groningen)


The problem of converting graphemes to phonemes is a part of the bigger,
text to speech conversion problem and it is solved traditionally with
rule-based approaches. An alternative that is being explored in the past
few years is to train neural networks (NNs) to convert sequences of
letters into sequences of phonemes for a given language. The first known
connectionist implementation of this task - the NETtalk model (Sejnowski
& Rosenberg 1987) - uses the static Multilayer Perceptron and encodes
with a shifting window, the context that the system needs in order to
compute the correspondent phoneme(s). Other reported systems employed
also the Multilayer Perceptron, but with different encoding schemes
(e.g., Seidenberg & McClelland, 1989). Most of the following
connectionist research continued using the MLP (e.g., Plaut, McClelland,
Seidenberg and Patterson, 1996; Zorzi & Houghton, 1998), until recently
(Stoianov, Stowe & Nerbonne, 1999), where another NN model - the Simple
Recurrent Networks (SRNs) - was used to map the orthographical to
phonological representations of 6100 monosyllabic words. The important
difference between those two NNs is that the latter one is dynamic and
keeps the information about the past events in its internal,
distributively represented memory. This enables the SRNs to learn
properly the sequential mapping from text to phonemes by observing and
producing single graphemes / phonemes only.

In the current report, we present further exploration of the the SRN capacity to learn such a complex mapping, by using feature-based grapheme and phoneme representations. This encoding provides a background knowledge to the neural network about the similarities across the phonemes and graphemes and makes the learning easier, as opposed to the orthogonal encoding in (Stoianov, Stowe and Nerbonne, 1999), where every phoneme is encoded with one output neuron. With feature based encoding, NNs with four times fewer parameters learned the same task, which means four times faster training. Also, this opened the possibility to learn larger corpora consisting of polysyllabic words.

In the presentation, more details about the model will be provided and the performance of the model will be compared to the performance of other methods.