Periode 3: February-April 2004
Aim of the course is to provide an introduction to repositories of linguistic data, by focussing on: annotated corpora (i.e., written and spoken), linguistic databases and computational lexicons.
Corpora represent a collection of texts which are usually annotated with part of speech and morphological information. They often encode syntactic and phonological information. Linguistic databases are usually oriented towards the description of specific phenomena either by means of examples with glosses or by means of variables. Computational lexicons are repositories of linguistic information for use in language processing applications.
Special attention will be dedicated to the techniques and to the standards which should be adopted in order to develop linguistic resources. In particular, the course will look at different types of corpora and will focus on the various levels of annotation with hands-on sessions on the Corpus of Spoken Dutch. Different types of databases will be discussed and various techniques to design linguistic databases will be presented. Furthermore, attention will be dedicated to Wordnet, as a different way to conceive lexical annotation whose design is inspired by current psycholinguistic theories of human lexical memory.
Issues concerning the archiving of linguistic resources on the web, including metadata standards (serving as finding aids) will be addressed and current initiatives will be presented. Furthermore, issues related to the digital encoding of language data will be discussed including standards such as XML.
The course will also focus on possible uses of language resources among which: linguistic research, development of new human language technologies and as teaching aids.
Corpora, databases, wordnet, annotation tools, standards for linguistic resources, metadata.
We will read the introduction which provides a good overview of the state of the art in the field of linguistic databases.
Van Eynde, F. (2000). Part of Speech Tagging en Lemmatisering. Leuven.
It provides a description of the part of speech tagging adopted within the project Corpus of Spoken Dutch.
Moortgat, M. et al. (2000). Syntactische Annotatie. Utrecht.
It provides a description of the syntactic annotation adopted within the project Corpus of Spoken Dutch.
Hoekstra et al. (2001). Syntactic Annotation for the Spoken Dutch Corpus Project CLIN 2000 Proceedings, edited by Walter Daelemans et al. (Amsterdam, Rodopi, 2001, 73-87).
It provides a short description (in English) of the syntactic annotation adopted within the project Corpus of Spoken Dutch.
Oostdijk N. et al. (2002). Experiences from the Spoken Dutch Corpus project. Proceedings of LREC 2002.
It provides a general description of the Corpus of Spoken Dutch project.
Boves, L. & N. Oostdijk. Spontaneous Speech in the Spoken Dutch Corpus. In Proceedings ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR). 14-16 April, 2003. Tokyo, Japan.
Oostdijk, N. & D. Broeder. The Spoken Dutch Corpus and Its Exploitation Environment. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). 14 April, 2003. Budapest, Hungary
Monachesi, P. et alii (2001) The Typological Database System. Proceedings of LREC 2002.
It provides a description of the Typological Database System project.
The CHILDES corpus
The Corpus of Spoken Dutch
AnnotateThe tool annotate can be used to anotate CGN sentences on Syntax. The command to start it is: ./start_annotate. The accounts for the students are an300-an309.
TigerSearchTo access TigerSearch on syntax:/usr/local/TIGERSearch-2.1/bin/TIGERSearch
Mysql --- Database software
There is only one weekly meeting for this course, but you are expected to work at home more than for an average course. The meeting is intended for discussion, presentation and for clarification of the reading material.
Hand in exercises:Friday of each week.
Exercises have to be in HTML.
You can find here the evaluation of your tasks.
Final Project: building a linguistic database (Week 3-10)