Databestanden in de Taaltechnologie
[ Top | Cursus | Literatuur | Hoorcollege | Werkcollege | Tentamen | Hertentamen ]

Blok 4: March-April 2001

Docent Paola Monachesi
Paola.Monachesi@let.uu.nl
tel: 030-2536653
Computationele Linguistiek
Trans 10, kamer 2.18
3512 JK Utrecht
Spreekuur: Wednesday 14.00-15.00  


Cursusbeschrijving
[ Top | Cursus | Literatuur | Hoorcollege | Werkcollege | Tentamen | Hertentamen ]

The course provides a basic introduction to repositories of linguistic data, by focussing on two of them: annotated corpora and linguistic databases.

Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables.

Corpora and databases are crucial tools in the formulation of linguistic generalizations and analyses. In addition, corpora are often used for machine learning purposes.

The course will look at different types of corpora (i.e. written, spoken) and will focus on the different levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented.

The course will provide the necessary basis for possible assistenships and stages within the related projects carried out at Utrecht University.

[ Dutch version ]

Algemene thematiek:

Corpora, databases, part of speech tagging, syntactic annotation, learning.



Literatuur
[ Top | Cursus | Literatuur | Hoorcollege | Werkcollege | Tentamen | Hertentamen ]

Nerbonne, J. (1998). Linguistic Databases. Stanford. CSLI.

We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases. We will also read a selection of the articles.

Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.

It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.

Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.

It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.

Nelleke Oostdijk (2000). Building a corpus of spoken Dutch Proceedings of the 10th CLIN Meeting. Utrecht.

It provides a general description of the Corpus of Spoken Dutch project.

Monachesi, P. (2000). Syntactic annotation for spontaneous speech fragments. Utrecht.

It provides an inventarisation of the problems related to the annotation of spontaneous sppech fragments.

Monachesi, P. (2001). Data retrival by means of a Typological Database System. Utrecht.

It provides a description of the Typological Database System project and an overview of different approaches to data retrieval from databases.

Relevant Links

Various corpora

Annotation Tools

Linguistic Databases

Linguistic Archives

Grammatical resources

Annotate



Werkcollege
[ Top | Cursus | Literatuur | Hoorcollege | Werkcollege | Tentamen | Hertentamen ]
Weekdag: Datum: Plaats: Tijd:
Wednesday from 14-3 KNG80, 0.09 15.00-17.00
Thursday KNG80, 0.09 13.00 - 17.00

Overzicht:

Week 1: Wednesday 14/3 Linguistic Corpora and Linguistic Databases: similarities and differences. Overview of different types of corpora (i.e., written, spoken). Applications using corpora: learning, speech recognition, language teaching, linguistic research.
Thursday 15/3 Overview of annotated corpora and tools via the LDC web page to get aquainted with the various corpora available. Inventarisation.
Week 2: Wednesday 21/3 Searching a corpus for linguistic analyses: Brown corpus (searching for adverb and preposition patterns) and CHILDES database.
Thursday 22/3 Continuation of corpus search.
Presentation of the Corpus Gesproken Nederlands project. Discussion on the different levels of annotation.
Week 3: Wednesday 28/3 Syntactic annotation in the CGN. Learning Annotate.
Thursday 29/3 Annotation of sentences.
Week 4: Wednesday 4/4 Continuation of sentence annotations. Discussion.
Problems of annotation related to spoken corpora.
Thursday 5/4 Different annotation tools.
Week 5: Wednesday 11/4 Databases to encode linguistic information. The Typological Database System project. Description. Aims. Problems. Getting acquainted with various linguistic databases.
Thursday 12/4 Linguistic Archives and Web publishing. Standards. Current initiatives. Problems.
Project: building a linguistic database.
Proposal.
Week 6: Wednesday 18/4 Continuation of building a linguistic database.
Thursday 19/4 Continuation of building a linguistic database.
Results.