Blok 4: March-April 2002
The course provides a basic introduction to repositories of linguistic data, by focussing on two of them: annotated corpora and linguistic databases.
Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables.
Corpora and databases are crucial tools in the formulation of linguistic generalizations and analyses. In addition, corpora are often used for machine learning purposes.
The course will look at different types of corpora (i.e. written, spoken) and will focus on the different levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented.
The course will provide the necessary basis for possible assistenships and stages within the related projects carried out at Utrecht University.[ Dutch version ]
Corpora, databases, annotation tools, standards for linguistic resources, metadata.
We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases.
Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.
It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.
Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.
It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.
Oostdijk N. et al. (2002). Experiences from the Spoken Dutch Corpus project. Proceedings of LREC 2002.
It provides a general description of the Corpus of Spoken Dutch project.Monachesi, P. (2000). Syntactic annotation for spontaneous speech fragments. Utrecht.
It provides an inventarisation of the problems related to the annotation of spontaneous speech fragments.
Monachesi, P. et alii (2001) The Typological Database System. Proceedings of the IRCS workshop on Linguistic Databases. Philadelphia.
It provides a description of the Typological Database System project.
Hand in exercises:Monday of each week until 12.00.
Exercises have to be in HTML.
Discussion:Wednesday of each week.
Evaluation:You can find here the evaluation of your tasks.
Final Project: building a linguistic database