The Romance Languages Database

Download the zipped database

Introduction

In our database we have decided to compare a few Romance Languages to see if they look the same. The languages that we chose are:

  • Latin: this is used as the reference language. All other languages are compared to this one.
  • Italian: we chose this language because it is a Romance language and we thought that it would compare well to the other languages. For this language we found a phonetic transcription (IPA).
  • Spanish: this was chosen for the same reasons as we chose Italian. We did not find a phonetic transcription.
  • Portuguese: this was chosen for the same reasons as above. We found a phonetic transcription (IPA).
  • French: this was partly chosen for the same reasons as Italian and Spanish and partly because we know some French. We found a phonetic transcription (IPA).
  • Romanian: we chose this language because it is a Romance language with Slavic influences which might be interesting for our Russian colleague.
  • English: we also added English Swadesh list because we do not actually speak any of the other languages.

    For the languages apart from French, Portuguese and Italian it was impossible to find any phonetic transcriptions in the libraries we checked (Letterenbibliotheek Utrecht, Centrale Bibliotheek Utrecht, Centrale Bibliotheek Apeldoorn). The internet is too expanded to find any helpful information on phonetic transcriptions. To be able to compare the languages in a proper way, we decided to use the Swadesh list that we found in another database made last year. The words in the Swadesh list are basic words used in every language.

    All the languages we added to our database are, as we already said, Romance languages. What we want to find out, is how similar they are, which languages are most similar, which word groups are most similar etc.

    Collecting data

    The Swadesh list we used can be found at http://www.trussel.com/kir/tip.htm section 1. We also added this list to our database. The next thing we did was go to the library and look for dictionaries.
  • In the Letterenbibliotheek Utrecht we found a few very old dictionaries for Latin. We used the Nederlands-Latijns woordenboek / [door] J. F. L. Montijn (1939).
  • For Spanish and French we used Van Dale Woordenboeken which we also found in the Letterenbibliotheek Utrecht.
  • For Romanian we used an on-line dictionary (can be found at http://www.dictionare.com/english/dictionary.htm).
  • For Italian and Portuguese we used Wolter's Miniwoordenboek which were found in the Centrale Bibliotheek Apeldoorn.
    After finding the dictionaries, it still took a lot of time to translate the Swadesh list into the different languages. Especially the phonetic transcription took a lot of time.

    Creating the database

    Albert was the one who mainly created the form of the database. All of us helped put the data in the computer. We typed the wordlists in Excel and Albert added them to the Access database. The biggest problem was the phonetic transcriptions. After a lot of trouble we did eventually find the unicode font (Lucida Sans Unicode) in Word, which meant we had to create a table in Word and insert the phonetic symbols by opening the window Insert Symbols and then click each symbol separately. As you will understand, this took a lot of time.

    Creating the actual database

    We used the structure of the database from last year. That database provided a structure which met our requirements. The database is created in Microsoft Access 2000. Every single table in the database presents the available information of a language. Because of that, is it possible to add other languages or maybe even dialects. In each table there is at least an orthographical transcription, and when it was found also a phonetical one of the Swadesh-list.

    The core of the database is the table with the orthographical transcription of Latin. This table contains the orthographic translation of the standard Swadish-list in Latin. Every entry in this table has its own identification number (ID). This table also consists of columns with the syntactic-tagging and with the original Swadesh-list in English. This table is the spill because the identification number is a unique number and is used trough the entire database to make relations between tables.

    There is also a table called Commentaar. This table presents the comment which is present in the original Swadesh-list. This comment makes the original words unambiguous. All through the entire database the entry’s are linked with the ID from the Latin table. We have chosen for a separate table because there is not a lot of comment and otherwise there will be a lot of empty records in the Latin table. Another reason to make a separate table is the possibility to insert comment for a different language. The table Commentaar can been seen as metadata, just like the table Talen. This table has nothing to do with the current research but gives additional information about the translator and the sources used for the transcription.

    The other tables in the database represents the transcriptions of a language. The design of these tables is the same: a column for the ID, one for the orthographic transcription and when it was available also a column for the phonetic transcription. Microsoft Access also needs a extra column. This column contains the name RealID. The idea behind this structure is that the translator looks in the column ID for a Latin word and translates it. The data can been entered directly in the righthand table but it is also possible to use a form.

    With the collected data is it possible to make queries about similarities or inequalities of the Romance languages.

    The results

    If we compare the tables of all the languages to each other we see many similarities.

    We made some queries to see which words matched exactly to Latin in which languages. Italian is the language that has most exact matches and Romanian has no exact matches at all. We shall give a table with 3 of the words that match. We leave out Romanian because it does not match any words.

    Latin French Italian Spanish Portuguese
    animal animal ... animal animal
    terra ... terra ... terra
    quando ... quando quando quando

    We will show how many of the words matched for each language. And obviously this is compared to Latin.

    Language
    Number of matches Percentage of total
    French
    3
    1,4%
    Italian
    15
    7%
    Spanish
    4
    1,9%
    Portuguese
    7
    3,3%

    We expected that there would be more words that are exactly the same. The words do look a lot alike, just not many are exactly the same. We think that the language are obviously family, but did have their separate changes troughout the years.
    We will look in the table with all the language compared to see if we can find similarities. The first thing that struck us is that Portugese and Spanish are very similar, more so than the others probably. We will give some examples of words that look similar in Portugese and Spanish and less so in the other languages.

    English French Italian Portuguese Romanian Spanish
    left de gauche sinistrol esquerdo stânga izquerda
    to turn tourner giro rodar roti rotar
    to eat manger mangiare comer mânka comer
    to walk marcher camminare andar umbia andar
    to swim nager nuotare nadar înotar nadar

    Another conclusion we reached was that Latin influenced all languages. Although there are not a lot of perfect matches, all the languages show some similarities with Latin. We will show this by giving a table with some perfect matches and very similar words.

    Latin French Italian Spanish Portuguese Romanian English
    animal animal animale animal animal animal animal
    lingua langue lingua língua lengua limba tongue
    odor odeur annusare odor oler miros smell
    sol soleil sole sol sol soare sun
    stella astre stella estrella estrella stea star
    terra terre terra terra tierra pamânt earth (soil)
    via route strada rua via cale road (trail, path)
    et et e e y si and
    quando quand quando quando quando când when?
    tu vous si tu vos da ye
    tu tu tu tu tu thou, you
    dare donner dare dar dar da to give
    dormire dormir dormire dormir dormir dormi to sleep
    lavare lavar lavare lavar lavar spala to wash
    non ne…pas non não no nu not
    spirare souffler spirare fazer vento soplar bate to blow (of wind)
    venire arriver venire vir venir veni to come
    volare voler volare voar volar zbura to fly

    It was difficult to compare the phonetic transcriptions to each other because you need similar spelled words to see the difference. We did see that French hardly ever looks similar to the Italian and Portuguese words. Italian and Portuguese always look more alike but there are hardly any perfect matches either. We chose the words that looked most alike and compared the phonetic transcriptions. This is the table:

    English French Italian Portuguese
    you ty tu tu
    to live vivr 'vi:vere vi'ver
    smoke fyme 'fuimo 'fumu
    and e e i

    We could not figure out how to insert the Lucida Sans Unicode font into the html-file so these were the only examples we can give in the table. What did strike us that Italian seems to have longer vowels than the others, and the words have more /e/ sounds as medial and final sounds of the word. Portuguese words have schwa sounds or nothing where Italian has the /e/ sound.

    Thoughts on the database

    We want to give our thoughts about how the database works and what it looks like. What we considered a problem was that it was difficult to compare the results because the database gives a choice between the exact matches in the different languages or the entire lists of words in one table. It would have been convenient if it had been possible to view words that are very similar in one table.

    Apart from this we think making a database to compare languages is useful. It gives you the opportunity to find information about languages that look alike. This could be an enormous advantage if you are doing research for certain phenomena in languages.

    In our database we also included phonetic transcriptions. As we already showed in our results it is very difficult to make conclusions on just the sight of the phonetic transcriptions. It would be timeconsuming to draw relevant conclusions, because we can only compare words that are exactly the same or very similar. This was beyond our timescope.

    Conclusions about the course

    We have now come to the end of our course Databestanden. We enjoyed doing the work eventhough it took us a fair amount of time. The exercises we had to do in the first 5 weeks of the course took quite some time. The final assignment: making the database, was very timeconsuming, but it was quite nice. The translating of the Swadesh list and the development of the database took most of the time. Viewing the results was rather difficult, so it took us far more time than we expected.

    Overall we can see that we learned a lot in this course. We did not know anything about corpora and databases and now we do know something about it. What we liked as well is that we worked with HTML a lot and learned a lot about that as well.

    We hope we did what you expected for the course. We spent a lot of time on it and enjoyed it.

    Ekatherina, Albert, Khiet en Gina.