Identifying and Integrating Terminologically Relevant Multiword Units in the IJS-ELAN Slovene-English Parallel Corpus

Gael Dias (New University of Lisbon)
Spela Vintar (University of Ljubljana)
Sylvie Guillore (Universite d'Orleans)
Jose Gabriel Pereira Lopes (New University of Lisbon)


The need for multilingual terminology resources has become
particularly acute owing to the globalization of scientific and
technical exchanges and the concurrent development of international
communication networtks. As a consequence, various efforts have
been made for the development of tools and methods dedicated to the
automatic processing of multilingual terminological databases.
However, the approaches presented so far highlight insufficiencies as
they only deal with specific groups of languages and hardly handle
the specific task of the identification of terminologically relevant
multilingual mutiword units. In order to overcome both problems, we
propose a three steps architecture:
  1. Statistical Identification of multiword lexical units candidates in parallel corpora,
  2. Tokenization of multiword lexical units (i.e. multiword lexical units are represented as single words) in parallel corpora,
  3. Statistical Extraction of multilingual terminology from parallel corpora.
In this paper, we propose the first two steps of the architecture.

A new system (SENTA: Sotware for the Extraction of N-ary Textual Associations) that needs no global threshold nor enticement techniques for the extraction of multiword lexical units, identifies and extracts multiword lexical units candidates from the tokenized IJS-ELAN Slovene-English parallel corpus (i.e. the texts are not lemmatised nor morfo-syntactically tagged nor pruned with lists of stop-words). The system has been developed by the authors and evidences high precision rates for all languages and domains, and shows total flexibility as it only depends on the information available in the texts and as a consequence can be applied to all kind of texts (all domains and all languages) without external intervention. Once the multiword lexical units have been extracted, they are represented in the corpus with SGML markups following the Text Encoding Initiative (TEI). Finally, in order to build the multilingual terminological database, we intend to experiment in the near future the Twente word alignment software that should identify term (multiword units or single words) pairs. Thus, the global architecture would allow the construction of multilingual terminologies without any restriction about language and domain.