Databestanden in de Taaltechnologie
[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Periode 3: February-April 2003

Docent Paola Monachesi
tel: 030-2536653
Computationele Linguistiek
Trans 10, kamer 2.18
3512 JK Utrecht
Spreekuur: Wednesday 13.00-14.00  

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

The course provides a basic introduction to repositories of linguistic data, by focussing on two of them: annotated corpora and linguistic databases.

Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables.

Corpora and databases are crucial tools in the formulation of linguistic generalizations and analyses. In addition, corpora are often used for machine learning purposes.

The course will look at different types of corpora (i.e. written, spoken) and will focus on the different levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented.

The course will provide the necessary basis for possible assistenships and stages within the related projects carried out at Utrecht University.

[ Dutch version ]

Algemene thematiek:

Corpora, databases, annotation tools, standards for linguistic resources, metadata.

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Nerbonne, J. (1998). Linguistic Databases. Stanford. CSLI.

We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases.

Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.

It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.

Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.

It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Hoekstra et al. (2001). Syntactic Annotation for the Spoken Dutch Corpus Project CLIN 2000 Proceedings, edited by Walter Daelemans et al. (Amsterdam, Rodopi, 2001, 73-87).

It provides a short description (in English) of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Oostdijk N. et al. (2002). Experiences from the Spoken Dutch Corpus project. Proceedings of LREC 2002.

It provides a general description of the Corpus of Spoken Dutch project.

Boves, L. & N. Oostdijk. Spontaneous Speech in the Spoken Dutch Corpus. In Proceedings ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR). 14-16 April, 2003. Tokyo, Japan.

Oostdijk, N. & D. Broeder. The Spoken Dutch Corpus and Its Exploitation Environment. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary

Monachesi, P. (2000). Syntactic annotation for spontaneous speech fragments. Utrecht.

It provides an inventarisation of the problems related to the annotation of spontaneous speech fragments.

Monachesi, P. et alii (2001) The Typological Database System. Proceedings of the IRCS workshop on Linguistic Databases. Philadelphia.

It provides a description of the Typological Database System project.

Relevant Links

Corpora: general

Various corpora

Corpus searches

The CHILDES corpus

The Corpus of Spoken Dutch

Annotation Tools



Linguistic Databases

Database software

Linguistic Archives


Useful software


[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]
Weekdag: Plaats: Tijd:
Wednesday KNG80, Room 009 13.00 - 17.00
Tuesday 18/3 (instead of 19/3) KNG80, Room 009 10.00 - 12.00


Week 1: Tuesday 4/2 Linguistic Corpora and Linguistic Databases: similarities and differences. Read article by John Nerbonne. Write a summary.
Wednesday 5/2 Overview of different types of corpora (i.e., written, spoken) via the LDC web page and the University of Helsinki web page to get acquainted with the various corpora available. Inventarisation.
Week 2: Tuesday 11/2 Searching a corpus for linguistic analyses: Brown corpus.
Wednesday 12/2 Continuation of corpus search.
Week 3: Tuesday 18/2 Corpus search: CHILDES database. Theory.
Wednesday 19/2 Searching a corpus for linguistic analyses: CHILDES database.
Week 4: Tuesday 25/2 The Corpus Gesproken Nederlands project. Discussion on the different levels of annotation. Read articles (Oostdijk et al. and Moortgat et al). Write a sumary.
Wednesday 26/2 Different annotation tools. Search de LDC webpage.
Week 5: Tuesday 4/3 Morphological annotation.
Wednesday 5/3 Syntactic annotation in the CGN. Learning Annotate. Annotation of sentences. Write a report.
Week 6: Tuesday 11/3 Using CGN for linguistic reserach.
Wednesday 12/3 Using CGN for linguistic research.
Week 7: Tuesday 18/3 Databases to encode linguistic information. Getting acquainted with various linguistic databases.
Wednesday 19/3 Read the papers by D. Brown from the Linguistic databases workshop and have a look at the webpage as well as S. Musgrave from the LREC conference. Write a summary.
Week 8: Tuesday 25/3 The Typological database system project. Have a look at the webpage. Read the paper. Write a summary.
Wednesday 26/3 Linguistic Archives and Web publishing. Standards. Metadata. The OLAC initiative. Read the paper. Write a summary.

Hand in exercises:

Friday of each week until 12.00.
Exercises have to be in HTML.


You can find here the evaluation of your tasks.

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Final Project: building a linguistic database (Week 9-11)