CLUVI: The Linguistic Corpus of the University of Vigo

abril 29, 2008 de 9:30 am | Publicado en Language Resources | Deja un comentario

CLUVI Parallel Corpus

CLUVI is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003. It contains over 22 million words, and its main components are:

  • the TECTRA Corpus of English-Galician literary texts
  • the FEGA Corpus of French-Galician literary texts
  • the LEGA Corpus of Galician-Spanish legal texts
  • the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts
  • the LOGALIZA Corpus of English-Galician software localization
  • and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information.

The public searching and browsing tool designed by the SLI is available at

This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations. The terms searched can correspond to either of the languages of the translation, but it is also possible to carry out true multilingual searches. A person can search simultaneously one term from each of the languages of translation.

The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is a academic research project in progress and with great vitality.

At the moment, the CLUVI Parallel Corpus webpage permits to search five major corpora:

  • FEGA
  • LEGA

The CLUVI interface also permits to browse the TURIGAL Corpus of Portuguese-English tourism texts, and the Legebiduna Corpus of Basque-Spanish administrative texts developed by the DELi group at the U. of Deusto.


On-line Documents

Research projects about the CLUVI Corpus

Tipologías de corpus

abril 23, 2008 de 1:14 pm | Publicado en Language Resources | Deja un comentario

Los principales parámetros para establecer tipologías de corpus se centran en:

  • La modalidad de la lengua: escrita, hablada
  • El número de lenguas a que pertenecen los textos
  • El tamaño o cantidad de textos que conforman el corpus
  • El carácter abierto o cerrado del corpus
  • La variedad lingüística o el grado de especialización de los textos
  • El período temporal que abarcan los textos
  • El tratamiento aplicado al corpus: información añadida a los textos

En relación con la lengua hay :

  • Corpus monolingües: están formados por textos de una sola lengua. Se recopilan con el objetivo de dar cuenta de una lengua o variedad lingüística.
  • Corpus bilingües o multilingües: están formados por textos de dos (bilingües) o más lenguas (multilingües) sin que, en principio, sean traducciones unos de otros y sin compartir criterios de selección.
  • Corpus comparables (“paired texts”): consisten en una selección de textos en más de una lengua o variedad lingüística parecidos en cuanto a sus características y que comparten criterios de selección. Se utilizan sobre todo para comparar variedades de la lengua en estudios contrastivos.
  • Corpus paralelos (“bi-texts”): recogen textos en más de una lengua (bilingües o multilingües) pero, a diferencia de los anteriores, se trata del mismo texto traducido a una o más lenguas. El más sencillo consta del original y su traducción. Son especialmente útiles en la traducción automática y en entornos bilingües o multilingües.
  • Corpus alineados: son corpus paralelos en los que, para facilitar su explotación, los textos están dispuestos unos al lado de otros en párrafos o frases, de tal forma que sea más fácil extraer las equivalencias de traducción: aquellos elementos que son traducciones mutuas. Se utilizan como entrenamiento para sistemas de traducción automática basados en estadísticas.

Todo depende del texto, de su extensión, de su especeficidad, de la cantidad, del proceso al que se someta.


What is a “corpus”?

abril 22, 2008 de 10:52 am | Publicado en Language Resources | Deja un comentario

According to An Encyclopedic Dictionary of Language and Languages (Crystal, David. 1992.  Oxford, 85) a corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language – for example, to determine how the usage of a particular sound, word, or syntactic construction varies.

Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts.

In the EAGLES recommendations on corpus typology (EAGLES, 1996e), a corpus is defined as:

Corpus: A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

Words such as `collection’ and `archive’ refer to sets of texts that do not need to be selected, or do not need to be ordered, or the selection and/or ordering do not need to be on linguistic criteria, They are therefore quite unlike corpora.

Linguistic criteria to be applied to the selection and ordering may be:

— in that they concern the participants, the occasion, the social setting or the communicative function of the pieces of language;

— in that they concern the recurrence of language patterns within the pieces of language.

These criteria are reviewed in more detail in the recommendations on corpus typology (EAGLES, 1996e) where a classification of different types of corpora can also be found.

Since this document is devoted to computer corpora, it is appropriate to start by the definition also proposed in the above document:

Computer corpus: a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks.

Source: ILC

Vermeer’s technique

abril 17, 2008 de 1:36 pm | Publicado en ESP | Deja un comentario

Vermeer produced transparent colours by applying paint onto the canvas in loosely granular layers, a technique called pointillé (not to be confused with pointillism). No drawings have been securely attributed to Vermeer, and his paintings offer few clues to preparatory methods. David Hockney, among other historians and advocates of the Hockney-Falco thesis, has speculated that Vermeer used a camera obscura to achieve precise positioning in his compositions, and this view seems to be supported by certain light and perspective effects which would result from the use of such lenses and not the naked eye alone; however, the extent of Vermeer’s dependence upon the camera obscura is disputed by historians.

There is no other seventeenth century artist who from very early on in his career employed, in the most lavish way, the exorbitantly expensive pigment lapis lazuli, natural ultramarine. Not only used in elements that are intended to be shown as appearance: the earth colours umber and ochre should be understood as warm light from the strongly-lit interior, reflecting its multiple colours back onto the wall.

This working method most probably was inspired by Vermeer’s understanding of Leonardo’s observations that the surface of every object partakes of the colour of the adjacent object.[5] This means that no object is ever seen entirely in its natural colour.

A comparable but even more remarkable yet effectual use of natural ultramarine is in The Girl with a Wineglass (Braunschweig). The shadows of the red satin dress are underpainted in natural ultramarine, and due to this underlying blue paint layer, the red lake and vermilion mixture applied over it acquires a slightly purple, cool and crisp appearance that is most powerful.

Even after Vermeer’s supposed financial breakdown following the so-called rampjaar (year of disaster) in 1672, he continued to employ natural ultramarine most generously, such as in the above-mentioned “Lady Seated at a Virginal.” This could suggest that Vermeer was supplied with materials by a collector, and would coincide with John Michael Montias’ theory of Pieter Claesz. van Ruijven being Vermeer’s patron.

(Taken from the Wiki)


abril 9, 2008 de 12:13 pm | Publicado en ESP | Deja un comentario

Claire told us that this semestre we will be dealing with the fabulous artist Johannes Vermeer and his paintings.

After having seen a documentary of Johannes Vermeer’s paintings, I have chosen: Young woman with a water pitcher. I find this painting very interesting to work on it.

It was a surprise for all the class to know that all our works in relation to Vermeer are going to be published in a book, so we will try to do our best

Blog Post II – Outline for the report

abril 1, 2008 de 12:25 pm | Publicado en Language Resources | Deja un comentario

Our topic is: Multilingual Corpus Resources.

We will take the information from this link: Joseba Abaitua – wiki. This is what more or less we are going to do with the topic. We still have to choose the pages because there are quite a lot of them and some of them are quite interesting. Let’s see what we can do with the pages:

  • all the possible search they offer
  • which results they offer
  • how they can be improved
  • their history, when they were created, for what goal, who were the creators
  • how we can compare them with similar sites: if it is better, worse, what they offer, if they offer more or less, the tools, etc.

We can be using pages like this one:


Blog de
Entries y comentarios feeds.