Difference between revisions of "Курсы машинного перевода для языков России/Session 8"

Revision as of 16:09, 8 January 2012

Case studies

The following case studies highlight different development styles.

Spanish and Catalan

The Spanish to Catalan translator is the oldest translator in Apertium. It is a rewrite / expansion of the interNOSTRUM translator that was developed at the Universitat d'Alacant. In total, it has been developed over a period of around 12 years. The original interNOSTRUM was released in early 2000 and took around 72 person-months (four people, 18 months) to develop (both engine, and linguistic data). It was widely used, with thousands of requests per day.

In 2004, the Apertium project was started with funding from the Ministry of Science, Industry and Commerce of the Spanish State to rewrite the code as open-source, and to convert the linguistic data. After one person year, the first version of the Spanish--Catalan translator was released.

The quality of the translator is very high, and compares with commercial systems -- over 95% coverage (around 5 unknown words out of 100 words), and between 3-7% word-error rate (out of 100 words, between 3-7 need to be changed in order to get an adequate translation). It is the second most used translator on the Apertium website, and is the main machine translation system of a number of universities in Spain for this language pair.

Norwegian Nynorsk and Norwegian Bokmål

The Nynorsk to Bokmål translator is the most-used translator on our webpage. It was started in 2008 by Francis Tyers and Trond Trosterud, using existing resources such as the Norsk Ordbank (a large full-form list of words in Nynorsk and Bokmål) and the Oslo-Bergen tagger (a constraint grammar based disambiguator for both varieties of Norwegian).

The original implementation was never completed, but the project was taken up again in 2009 during the Google Summer of Code competition by Kevin Unhammer, a masters' student in computational linguistics at the University of Bergen. Over a period of three months, Kevin completely remade the conversion of both the Ordbank, and the constraint grammar, and wrote a series of transfer rules.

He spent 2 weeks converting the Ordbank to Apertium format, then another week converting the Oslo-Bergen tagger. Three weeks on transfer rules, and then another three weeks expanding the dictionaries. Two weeks were then spent on "cleaning up" tasks, e.g. making sure that only words that were in all three dictionaries were included. Then the final week on evaluation.

The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, meaning that out of 100 words, 17 have to be changed in order to get an adequate translation. This is competitive with the other available system for translation Nynorsk-Bokmål. The system today accounts for over a third of all translations carried out on the Apertium website.

Breton and French

...

Existing resources

Objectives

It is important when starting a project to ask questions and clearly define objectives, for example,

Who is the target audience ?
- Do we want the system to be used by professional translators, by lay-translators, by ordinary members of the public ?
What is the system intended to be used for ?
- Assimilation: For giving an idea of what a text is about
- Dissemination: For producing draft translations
- Domain: Will it be used for news texts, encyclopaedic texts, legal texts, the weather, etc. ?
What existing linguistic resources can be reused ?
- Are there already good, free dictionaries available ?
How long do we have to build the system ?
- Six months is probably not enough enough time to build your ideal wide-domain interlingua MT system between all the languages of the Middle Volga, ...
  - ... but it might be enough to build some prototype systems for translating the weather forecast.

If a system is being planned to translate governmental texts for dissemination, then it will necessarily have different features than if it is planned to translate Wikipedia articles for assimilation.

Making a high-coverage, open domain system for assimilation and dissemination is a nice idea, but practically is not possible given limited resources.

Time

The amount of time taken to make a new language pair based on the Apertium platform depends greatly on the objectives of the project, the existing resources available and the experience of the developers. A prototype or proof of concept system can be created in anywhere from 10 days to 3 months. Whereas a production system can take from 3 months (as in the case of Nynorsk-Bokmål) to several years.

Year	Total pairs	New pairs	Language pairs
2005	3	3	es-ca, es-gl, es-pt
2006	6	3	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca
2007	8	2	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es
2008	18	10	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca
2009	21	3	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr
2010	23	2	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg
2011	33	10	es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg, is-en, ca-it, eo-fr, mk-en, es-an, eu-en, es-it, sh-mk, tr-az, tr-ky

Funding

Apertium language pairs have been funded and created in many different ways.

The following table summarises the ways in which development on the "stable" machine translation systems in Apertium was funding. The most used systems online are highlighted in bold face.

Funder	Type	Language pair(s)
Google Summer of Code	Competition	mk-bg, nn-nb, sh-mk, sv-da, tr-az, tr-ky
—	Volunteers	eo-fr, es-an, mk-en, ca-it, eo-en
Generalitat de Catalunya	Regional government	oc-ca, oc-es, en-ca,
—	Thesis / Dissertation	cy-en, fr-ca, pt-ca
Universitat d'Alacant	Educational institution	eu-es, (br-fr), es-pt
Ministry of Industry, Commerce and Tourism	National government	es-ca, es-gl
ABC Enciklopedioj	Company	eo-es, eo-ca
imaxin\|software	Company	en-gl, pt-gl
Universidá d'Uviéu	Educational institution	es-ast
Prompsit	Company	es-it, (br-fr), (fr-es)
Eleka Ingenieritza Linguistikoa	Company	fr-es
Icelandic Research Council	National government	is-en
Ofis ar Brezhoneg	Quasi non-governmental organisation	br-fr
European Assoc. Machine Translation	Non-governmental organisation	eu-en

@@ Line 8: / Line 8: @@
 The Spanish to Catalan translator is the oldest translator in Apertium. It is a rewrite / expansion of the <code>interNOSTRUM</code> translator that was developed at the Universitat d'Alacant. In total, it has been developed over a period of around 12 years. The original <code>interNOSTRUM</code> was released in early 2000 and took around 72 person-months (four people, 18 months) to develop (both engine, and linguistic data). It was widely used, with thousands of requests per day.
-In 2004, the Apertium project was started with funding from the Ministry of Science, Industry and Commerce of the Spanish State to rewrite the code as open-source, and convert the linguistic data. After one person year, the first version of the Spanish--Catalan translator was released.
+In 2004, the Apertium project was started with funding from the Ministry of Science, Industry and Commerce of the Spanish State to rewrite the code as open-source, and to convert the linguistic data. After one person year, the first version of the Spanish--Catalan translator was released.
 The quality of the translator is very high, and compares with commercial systems -- over 95% coverage (around 5 unknown words out of 100 words), and between 3-7% word-error rate (out of 100 words, between 3-7 need to be changed in order to get an adequate translation). It is the second most used translator on the Apertium website, and is the main machine translation system of a number of universities in Spain for this language pair.

Difference between revisions of "Курсы машинного перевода для языков России/Session 8"

Revision as of 16:09, 8 January 2012

Contents

Case studies

Spanish and Catalan

Norwegian Nynorsk and Norwegian Bokmål

Breton and French

...

Existing resources

Objectives

Time

Funding

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools