Databestanden in de Taaltechnologie
Periode 3: February-April 2004

Docent Paola Monachesi
tel: 030-2536065
Computationele Linguistiek
Trans 10, kamer 2.13
3512 JK Utrecht

Aim of the course is to provide an introduction to repositories of linguistic data, by focussing on: annotated corpora (i.e., written and spoken), linguistic databases and computational lexicons.

Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables. Computational lexicons are repositories of linguistic information for use in language processing applications.

Special attention will be dedicated to the techniques and to the standards which should be adopted in order to develop linguistic resources. In particular, the course will look at different types of corpora and will focus on the various levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented. Furthermore, attention will be dedicated to Wordnet, as a different way to conceive lexical annotation whose design is inspired by current psycholinguistic theories of human lexical memory.

Issues concerning the archiving of linguistic resources on the web, including metadata standards (serving as finding aids) will be addressed and current initiatives will be presented. Furthermore, issues related to the digital encoding of language data will be discussed including standards such as XML.

The course will also focus on possible uses of language resources among which: linguistic research, development of new human language technologies and as teaching aids.

Algemene thematiek:

Corpora, databases, wordnet, annotation tools, standards for linguistic resources, metadata.

Nerbonne, J. (1998). Linguistic Databases. Stanford. CSLI.

We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases.

Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.

It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.

Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.

It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Hoekstra et al. (2001). Syntactic Annotation for the Spoken Dutch Corpus Project CLIN 2000 Proceedings, edited by Walter Daelemans et al. (Amsterdam, Rodopi, 2001, 73-87).

It provides a short description (in English) of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Oostdijk N. et al. (2002). Experiences from the Spoken Dutch Corpus project. Proceedings of LREC 2002.

It provides a general description of the Corpus of Spoken Dutch project.

Boves, L. & N. Oostdijk. Spontaneous Speech in the Spoken Dutch Corpus. In Proceedings ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR). 14-16 April, 2003. Tokyo, Japan.

Oostdijk, N. & D. Broeder. The Spoken Dutch Corpus and Its Exploitation Environment. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary

Monachesi, P. et alii (2001) The Typological Database System. Proceedings of LREC 2002.

It provides a description of the Typological Database System project.

Relevant Links

Corpora: general

Various corpora

Corpus searches

The CHILDES corpus

The Corpus of Spoken Dutch

Annotation Tools


The tool annotate can be used to anotate CGN sentences on Syntax. The command to start it is: ./start_annotate. The accounts for the students are an300-an309.


To access TigerSearch on syntax:/usr/local/TIGERSearch-2.1/bin/TIGERSearch

Linguistic Databases

Mysql --- Database software


Linguistic Archives


Useful software



Weekdag: Plaats: Tijd:
Wednesday KNG80, Room 1.08 13.00 - 17.00
Thursday 18-3 (instead of 17-3) KNG80, Room 1.08 13.00 - 17.00

There is only one weekly meeting for this course, but you are expected to work at home more than for an average course. The meeting is intended for discussion, presentation and for clarification of the reading material.


Week 1: Wednesday 11/2 Linguistic Corpora and Linguistic Databases: similarities and differences. Read article by John Nerbonne. Write a summary.
Databases to encode linguistic information. Getting acquainted with various linguistic databases. Read the papers by D. Brown from the Linguistic databases workshop and have a look at the webpage as well as S. Musgrave from the LREC conference. Write a summary.
Week 2: Wednesday 18/2 The Typological database system project. Have a look at the webpage. Read the paper.
Write a summary.
Read documentation Mysql.
Overview of different types of corpora (i.e., written, spoken) via LDC and the various links on the course web page.
Week 3: Wednesday 25/2 Searching a corpus for linguistic analyses: Brown corpus.
Read article and make a report.
Week 4: Wednesday 3/3 Searching a corpus for linguistic analyses: CHILDES database.
Read article and make a report.
Week 5: Wednesday 10/3 The Corpus Gesproken Nederlands project. Discussion on the different levels of annotation. Read articles (Oostdijk et al. and Moortgat et al). Write a sumary.
Different annotation tools. Search de LDC webpage. Inventarisation
Week 6: Wednesday 17/3 Morphological and syntactic annotation in the CGN. Learning Annotate. Read articles. Annotation of sentences. Write a report.
Week 7: Wednesday 24/3 Using CGN for linguistic research. Read scriptie. Make a report.
Week 8: Tuesday 31/3 Wordnet. Read the first and last paper. Make a report.
Linguistic Archives and Web publishing. Standards. Metadata. The OLAC initiative. Read the paper and make a report.
Week 9: Monday 5/4
wrapping up

Hand in exercises:

Friday of each week.
Exercises have to be in HTML.


  • Weekly tasks: 30%
  • Presentations and class participation: 20%
  • Final project: 50%

Final Project: building a linguistic database (Week 3-10)