What is Bwananet?

mayo 13, 2008 a las 11:27 am | Publicado en Language Resources | Deja un comentario

BwanaNet is an interface developed at the IULA that allows to query the Technical Corpus(CT) of the Institut via Internet.
With BwanaNet people can consult the CT-IULA documents. These are the steps to follow:

  • 1. Select the language document.
  • 2. Select if you want to do a monolingual or a multilingual consult.
  • 3. Select the documents
  • 4. Define the kind of consult
  • 5. Define the consult
  • 6. Visualize the results

CLUVI: The Linguistic Corpus of the University of Vigo

abril 29, 2008 a las 9:30 am | Publicado en Language Resources | Deja un comentario

CLUVI Parallel Corpus

CLUVI is an open set of parallel textual corpora of specialized registers of contemporary Galician language developed by the SLI (Computational Linguistics Group of the University of Vigo) and publicly available in its website since September 2003. It contains over 22 million words, and its main components are:

  • the TECTRA Corpus of English-Galician literary texts
  • the FEGA Corpus of French-Galician literary texts
  • the LEGA Corpus of Galician-Spanish legal texts
  • the UNESCO Corpus of English-Galician-French-Spanish scientific-technical divulgation texts
  • the LOGALIZA Corpus of English-Galician software localization
  • and the CONSUMER Corpus of Spanish-Galician-Catalan-Basque consumer information.

The public searching and browsing tool designed by the SLI is available at http://sli.uvigo.es/CLUVI/.

This web application permits both simple and very complex searches of isolated words or sequences of words, and shows the multilingual equivalences of the terms in context, as found in real and referenced translations. The terms searched can correspond to either of the languages of the translation, but it is also possible to carry out true multilingual searches. A person can search simultaneously one term from each of the languages of translation.

The number of aligned works and language pairs available in the website increases regularly, since the CLUVI is a academic research project in progress and with great vitality.

At the moment, the CLUVI Parallel Corpus webpage permits to search five major corpora:

  • FEGA
  • LEGA

The CLUVI interface also permits to browse the TURIGAL Corpus of Portuguese-English tourism texts, and the Legebiduna Corpus of Basque-Spanish administrative texts developed by the DELi group at the U. of Deusto.


On-line Documents

Research projects about the CLUVI Corpus

Tipologías de corpus

abril 23, 2008 a las 1:14 pm | Publicado en Language Resources | Deja un comentario

Los principales parámetros para establecer tipologías de corpus se centran en:

  • La modalidad de la lengua: escrita, hablada
  • El número de lenguas a que pertenecen los textos
  • El tamaño o cantidad de textos que conforman el corpus
  • El carácter abierto o cerrado del corpus
  • La variedad lingüística o el grado de especialización de los textos
  • El período temporal que abarcan los textos
  • El tratamiento aplicado al corpus: información añadida a los textos

En relación con la lengua hay :

  • Corpus monolingües: están formados por textos de una sola lengua. Se recopilan con el objetivo de dar cuenta de una lengua o variedad lingüística.
  • Corpus bilingües o multilingües: están formados por textos de dos (bilingües) o más lenguas (multilingües) sin que, en principio, sean traducciones unos de otros y sin compartir criterios de selección.
  • Corpus comparables (“paired texts”): consisten en una selección de textos en más de una lengua o variedad lingüística parecidos en cuanto a sus características y que comparten criterios de selección. Se utilizan sobre todo para comparar variedades de la lengua en estudios contrastivos.
  • Corpus paralelos (“bi-texts”): recogen textos en más de una lengua (bilingües o multilingües) pero, a diferencia de los anteriores, se trata del mismo texto traducido a una o más lenguas. El más sencillo consta del original y su traducción. Son especialmente útiles en la traducción automática y en entornos bilingües o multilingües.
  • Corpus alineados: son corpus paralelos en los que, para facilitar su explotación, los textos están dispuestos unos al lado de otros en párrafos o frases, de tal forma que sea más fácil extraer las equivalencias de traducción: aquellos elementos que son traducciones mutuas. Se utilizan como entrenamiento para sistemas de traducción automática basados en estadísticas.

Todo depende del texto, de su extensión, de su especeficidad, de la cantidad, del proceso al que se someta.


What is a “corpus”?

abril 22, 2008 a las 10:52 am | Publicado en Language Resources | Deja un comentario

According to An Encyclopedic Dictionary of Language and Languages (Crystal, David. 1992.  Oxford, 85) a corpus is a collection of linguistic data, either compiled as written texts or as a transcription of recorded speech. The main purpose of a corpus is to verify a hypothesis about language – for example, to determine how the usage of a particular sound, word, or syntactic construction varies.

Corpus linguistics deals with the principles and practice of using corpora in language study. A computer corpus is a large body of machine-readable texts.

In the EAGLES recommendations on corpus typology (EAGLES, 1996e), a corpus is defined as:

Corpus: A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

Words such as `collection’ and `archive’ refer to sets of texts that do not need to be selected, or do not need to be ordered, or the selection and/or ordering do not need to be on linguistic criteria, They are therefore quite unlike corpora.

Linguistic criteria to be applied to the selection and ordering may be:

— in that they concern the participants, the occasion, the social setting or the communicative function of the pieces of language;

— in that they concern the recurrence of language patterns within the pieces of language.

These criteria are reviewed in more detail in the recommendations on corpus typology (EAGLES, 1996e) where a classification of different types of corpora can also be found.

Since this document is devoted to computer corpora, it is appropriate to start by the definition also proposed in the above document:

Computer corpus: a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks.

Source: ILC

Blog Post II – Outline for the report

abril 1, 2008 a las 12:25 pm | Publicado en Language Resources | Deja un comentario

Our topic is: Multilingual Corpus Resources.

We will take the information from this link: Joseba Abaitua – wiki. This is what more or less we are going to do with the topic. We still have to choose the pages because there are quite a lot of them and some of them are quite interesting. Let’s see what we can do with the pages:

  • all the possible search they offer
  • which results they offer
  • how they can be improved
  • their history, when they were created, for what goal, who were the creators
  • how we can compare them with similar sites: if it is better, worse, what they offer, if they offer more or less, the tools, etc.

We can be using pages like this one:


The European Language Resources Association

febrero 26, 2008 a las 12:37 pm | Publicado en Language Resources | Deja un comentario

The European Language Resources Association (ELRA) was established as a non-profit organisation in Luxembourg in February, 1995.

ELRA is the driving force to make available the language resources for language engineering and to evaluate language engineering technologies.

In order to achieve this goal, ELRA is active in:

  • identification
  • distribution
  • collection
  • validation
  • standardisation
  • improvement
  • promoting the production of language resources
  • supporting the infrastructure to perform evaluation campaigns and in developing a scientific field of language resources
  • evaluation

Blog de WordPress.com.
Entries y comentarios feeds.