Databestanden in de Taaltechnologie
[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Periode 2: November 2004 - Februari 2005

Docent Paola Monachesi
tel: 030-2536065
Computationele Linguistiek
Trans 10, kamer 2.13
3512 JK Utrecht

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Aim of the course is to provide an introduction to repositories of linguistic data, by focussing on: annotated corpora (i.e., written and spoken), linguistic databases and computational lexicons.

Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables. Computational lexicons are repositories of linguistic information for use in language processing applications.

Special attention will be dedicated to the techniques and to the standards which should be adopted in order to develop linguistic resources. In particular, the course will look at different types of corpora and will focus on the various levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented. Furthermore, attention will be dedicated to Wordnet, as a different way to conceive lexical annotation whose design is inspired by current psycholinguistic theories of human lexical memory.

Issues concerning the archiving of linguistic resources on the web, including metadata standards (serving as finding aids) will be addressed and current initiatives will be presented. Furthermore, issues related to the digital encoding of language data will be discussed including standards such as XML.

The course will also focus on possible uses of language resources among which: linguistic research, development of new human language technologies and as teaching aids.

Algemene thematiek:

Corpora, databases, wordnet, annotation tools, standards for linguistic resources, metadata.

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Nerbonne, J. (1998). Linguistic Databases. Stanford. CSLI.

We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases.

Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.

It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.

Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.

It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Hoekstra et al. (2001). Syntactic Annotation for the Spoken Dutch Corpus Project CLIN 2000 Proceedings, edited by Walter Daelemans et al. (Amsterdam, Rodopi, 2001, 73-87).

It provides a short description (in English) of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Oostdijk N. et al. (2002). Experiences from the Spoken Dutch Corpus project. Proceedings of LREC 2002.

It provides a general description of the Corpus of Spoken Dutch project.

Boves, L. & N. Oostdijk. Spontaneous Speech in the Spoken Dutch Corpus. In Proceedings ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR). 14-16 April, 2003. Tokyo, Japan.

Oostdijk, N. & D. Broeder. The Spoken Dutch Corpus and Its Exploitation Environment. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary

Monachesi, P. et alii (2001) The Typological Database System. Proceedings of LREC 2002.

It provides a description of the Typological Database System project.

Relevant Links

Corpora: general

Language resources associations and networks

Various corpora

Corpus searches

The CHILDES corpus

The Corpus of Spoken Dutch

Annotation Tools


The tool annotate can be used to anotate CGN sentences on Syntax. The command to start it is: ./start_annotate. The accounts for the students are an300-an309.


To access TigerSearch on syntax:/usr/local/TIGERSearch-2.1/bin/TIGERSearch

Linguistic Databases

Mysql --- Database software


Linguistic Archives


Useful software



[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]
Weekdag: Plaats: Tijd:
Wednesday KNG80, Room 1.08 10.00 - 13.00
15.00 - 16.00

There is only one weekly meeting for this course, but you are expected to work at home more than for an average course. The meeting is intended for discussion, presentation and for clarification of the reading material.


Week 1: Wednesday 24/11 Language resources, their retrieval, their need by industry, Standards. Metadata. The OLAC initiative.
Read the following papers in this order:
  1. Industrial needs for language resources: paper
  2. Technical, strategic and political issues of language resources: paper
  3. European Language Resources Association: paper
  4. The OLAC initiative: paper
  5. Retrieval of language resources: paper
Have a look at the links about ELRA and LDC, as well as all the links about Metadata.
Make a comprehensive report which will combine all the various papers and links together, point out possible problems and shortcomings and points for future investigation and research. Write it in html (check the links for a tutorial).
Prepare a small presentation with the main points which will also include points of discussion.
Week 2: Wednesday 1/12 Databases to encode linguistic information. Getting acquainted with various linguistic databases. Read the following papers:
  1. The surrey databases
  2. The Spinoza database
  3. The MedTyp database
  4. The typological database project: architecture
  5. The typological database project: data integration
Have a look at the various links under the Linguistic databases section, try the various databases, have a look at the programme and the papers of the Linguistic database workshop.

Make a comprehensive report which will link all the various papers together, as well as the links, point out possible problems and shortcomings and point for future investigation and reserach. Write it in html.
Prepare a small presentation with the main points which will also include points of discussion.
From the previous session, check:
  1. validation criteria of LDC, ELRA
  2. selling and acquisition policies of both ELRA, LDC
  3. LR for arabic (Nemlar), S. Krauwer will come to talk about it.
Week 3: Wednesday 8/12 Evaluate the different papers on typological databases and all the databases online. Define what the shortcomings are and what are the positive features. Define the criteria for an 'ideal typological database', that is if you will have to build your own typological database, what will it look like? Which features will be present? Think about software, theory, standards, interface, data, accessibility, etc.
Week 4: Wednesday 15/12 Overview of different types of corpora (i.e., written, spoken) via LDC and the various links on the course web page. Inventarisation. Searching a corpus for linguistic analyses: BNC corpus.
Read article and make a report.
Week 5: Wednesday 22/12
XML and XSLT and their use within language resources. Make the tutorials (cf. links). Make the exercises
Various annotation tools. Check the links under Annotation Tools. Inventarisation
Check the various links under Corpora and corpora seraches and prepare points for discussion.
Make a proposal for the end project database: decide a topic, choose software, point out requirements.
Week 6: Wednesday 12/1 Searching a corpus for linguistic analyses: CHILDES database.
Read article and make a report.
Week 7: Wednesday 19/1 The Corpus Gesproken Nederlands project. Discussion on the different levels of annotation. Read articles (Oostdijk et al. (2 papers), Boves et al. and Hoekstra et al). Write a sumary.
Semantic annotation. Read the papers I have sent. Evaluation of possibilities. Possible annotation level.
and xml exercises.
Week 8: Tuesday 26/1 Wordnet. Read the first and last paper. Make a report.
Week 9: Monday 2/2
wrapping up

Hand in exercises:

Friday of each week.
Exercises have to be in HTML.


  • Weekly tasks: 50%
  • Presentations and class participation: 20%
  • Final project: 30%

You can find here the evaluation of your tasks.

[ Top | Cursus | Literatuur | Werkcollege | Tentamen ]

Final Project: building a linguistic database (Week 3-10)