Difference between revisions of "Курсы машинного перевода для языков России/Session 8"

From Apertium
Jump to navigation Jump to search
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}

The main idea of this session is to show that making a new translation system with Apertium need not be a multi-million euro undertaking spanning several years. A good proportion of the systems developed with Apertium have taken only a few person months to develop, taking advantage of existing resources, and the extensible nature of the engine.

While more time and money is often an effective way of making a better machine translation system, a lot can be said for realistic expectations, careful planning and effective contributions by volunteer developers.

== Case studies ==
== Case studies ==


The following case studies highlight different development styles.
The following case studies highlight four different successful development styles.


=== Spanish and Catalan ===
=== Spanish and Catalan ===

:''Long-term public funding, several developers. Catalan has around nine million speakers.''

The Spanish to Catalan translator is the oldest translator in Apertium. It is a rewrite / expansion of the <code>interNOSTRUM</code> translator that was developed at the Universitat d'Alacant. In total, it has been developed over a period of around 12 years. The original <code>interNOSTRUM</code> was released in early 2000 and took around 72 person-months (four people, 18 months) to develop (both engine, and linguistic data). It was widely used, with thousands of requests per day.

In 2004, the Apertium project was started with funding from the Ministry of Science, Industry and Commerce of the Spanish State to rewrite the code as open-source, and to convert the linguistic data. After one person year, the first version of the Spanish--Catalan translator was released.

The quality of the translator is very high, and compares with commercial systems -- over 95% coverage (around 5 unknown words out of 100 words), and between 3-7% word-error rate (out of 100 words, between 3-7 need to be changed in order to get an adequate translation). It is the second most used translator on the Apertium website, and is the main machine translation system of a number of universities in Spain for this language pair.


=== Norwegian Nynorsk and Norwegian Bokmål ===
=== Norwegian Nynorsk and Norwegian Bokmål ===

:''Short-term funding from a competition, one developer. Nynorsk is the "preferred standard" of approximately 580,000 Norwegians.''

The Nynorsk to Bokmål translator is the most-used translator on our webpage. It was started in 2008 by Francis Tyers and Trond Trosterud, using existing resources such as the Norsk Ordbank (a large full-form list of words in Nynorsk and Bokmål) and the Oslo-Bergen tagger (a constraint grammar based disambiguator for both varieties of Norwegian).

The original implementation was never completed, but the project was taken up again in 2009 during the Google Summer of Code competition by Kevin Unhammer, a masters' student in computational linguistics at the University of Bergen. Over a period of three months, Kevin completely remade the conversion of both the Ordbank, and the constraint grammar, and wrote a series of transfer rules.

He spent 2 weeks converting the Ordbank to Apertium format, then another week converting the Oslo-Bergen tagger. Three weeks on transfer rules, and then another three weeks expanding the dictionaries. Two weeks were then spent on "cleaning up" tasks, e.g. making sure that only words that were in all three dictionaries were included. Then the final week on evaluation.

The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, meaning that out of 100 words, 17 have to be changed in order to get an adequate translation. This is competitive with the other available system for translation Nynorsk-Bokmål. The system today accounts for over a third of all translations carried out on the Apertium website.


=== Breton and French ===
=== Breton and French ===


:''Medium-term volunteer effort with very-short term public/private funding and several developers. Breton is spoken by around 200,000 people.''


Work on the Breton--French translator was started in 2008 by Francis Tyers in his spare time. After three months, a proof-of-concept system, using transfer rules from the French--Spanish pair was presented at Ofis ar Brezhoneg in December. It was decided that funding would be found to support another month of development to finish a prototype system which would be useful for assimilation purposes.
== ... ==

In the end, funding for the travel of a Breton speaker to Alacant was arranged by Ofis ar Brezhoneg, the Universitat d'Alacant paid for a month's wages of the Breton speaker, and Prompsit Language Engineering paid for a month of Francis Tyers' time. A further two months. In total, the monetary cost was around €3,000. The first version of the translator was released in May of 2009.

The first version had a coverage of around 85%, and a word error rate which, although high, still meant that the translator was useful for assimilation. The system is today [http://www.ofis-bzh.org/fr/ressources_linguistiques/index-troerofis.php available] on the homepage of Ofis ar Brezhoneg, and updated by staff of the Ofis, including its director, Fulup Jakez.

=== Spanish and Aragonese ===
:''Medium-term volunteer effort with no public funding and two developers. Aragonese has around 10,000 speakers.''
Work on the Spanish-Aragonese translator was begun by Apertium-developer Jim O'Regan, at the request of Aragonese-speaker Juan Pablo Martínez. After three weeks of initial effort, spread over the course of a year, a final week of concentrated effort lead to the release of the first prototype version, translating from Aragonese to Spanish only.
The first bidirectional version was completed after another 6 weeks of work by Juan Pablo, spread over the course of another year. The only available resource at the beginning of this work for Aragonese was the Aragonese edition of Wikipedia and a handful of verb templates on the English edition of Wiktionary. The Aragonese--Spanish dictionary was created by hand, but the Spanish morphological analyser/generator and part-of-speech tagger were taken from the Spanish--Catalan pair. No funding was received from any source towards the creation of the system.

== Contributing factors ==


=== Existing resources ===
=== Existing resources ===

When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are available under free/open-source licences they can be reused and save development time. However, the amount of time taken for adaptation should not be underestimated.

A morphological transducer designed for spell checking might make a great spell checker, but it might not be so easy to adapt it for analysis/generation in a machine translation system. Different applications have different requirements, and this should be taken into account when deciding to reuse an existing resource, to adapt it, or to start from scratch.


=== Objectives ===
=== Objectives ===
Line 25: Line 66:
** Assimilation: For giving an idea of what a text is about
** Assimilation: For giving an idea of what a text is about
** Dissemination: For producing draft translations
** Dissemination: For producing draft translations
** Domain: Will it be used for news texts, encyclopaedic texts, legal texts, etc. ?
** Domain: Will it be used for news texts, encyclopaedic texts, legal texts, the weather, etc. ?
* What existing linguistic resources can be reused ?
* What existing linguistic resources can be reused ?
** Are there already good, free dictionaries available ?
** Are there already good, free dictionaries available ?
* How long do we have to build the system ?
* How long do we have to build the system ?
** Six months is probably not enough enough time to build your ideal wide-domain interlingua MT system between all the languages of the Middle Volga, ...
*** ... but it might be enough to build some prototype systems for translating the weather forecast.


If a system is being planned to translate governmental texts for dissemination,
If a system is being planned to translate governmental texts for dissemination,
Line 45: Line 88:
take from 3 months (as in the case of Nynorsk-Bokmål) to several years.
take from 3 months (as in the case of Nynorsk-Bokmål) to several years.


The following table summarises the development of language pairs in Apertium over the last six years.


{|class="wikitable"
! Year !! Total pairs !! New pairs !! Language pairs
|-
| 2005 || 3 || 3 || es-ca, es-gl, es-pt
|-
| 2006 || 6 || 3 || es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca
|-
| 2007 || 8 || 2 || es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es
|-
| 2008 || 18 || 10 || es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca
|-
| 2009 || 21 || 3 || es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr
|-
| 2010 || 23 || 2 || es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg
|-
| 2011 || 33 || 10 ||es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg, is-en, ca-it, eo-fr, mk-en, es-an, eu-en, es-it, sh-mk, tr-az, tr-ky
|-
|}


=== Funding ===
=== Funding ===


Apertium language pairs have been funded and created in many different ways.
As we saw in the case studies, Apertium language pairs have been funded and created in many different ways. The following table summarises the ways in which development on the "stable" machine translation systems in Apertium was funding. The most used systems online are highlighted in bold face.

The following table summarises the ways in which development on the "stable"
machine translation systems in Apertium was funding. The most used systems online
are highlighted in bold face.


{|class="wikitable"
{|class="wikitable"
Line 70: Line 128:
| Ministry of Industry, Commerce and Tourism || National government || '''es-ca''', es-gl
| Ministry of Industry, Commerce and Tourism || National government || '''es-ca''', es-gl
|-
|-
| ABC Enciklopedoj || Company || eo-es, eo-ca
| ABC Enciklopedioj || Company || eo-es, eo-ca
|-
|-
| imaxin<nowiki>|</nowiki>software || Company || en-gl, pt-gl
| imaxin<nowiki>|</nowiki>software || Company || en-gl, pt-gl
Line 87: Line 145:
|-
|-
|}
|}

==Practical==

* Search for existing linguistic resources for your language pair, making a note of the licence they are under.
* Define a realistic work/time-plan for making a new translator for a given language pair with Apertium taking into account:
** Existing resources
** What the translator will be used for
** Man power
** Political, financial and community support

[[Category:Машинный перевод для языков России|Session 8]]

Latest revision as of 12:00, 31 January 2012

The main idea of this session is to show that making a new translation system with Apertium need not be a multi-million euro undertaking spanning several years. A good proportion of the systems developed with Apertium have taken only a few person months to develop, taking advantage of existing resources, and the extensible nature of the engine.

While more time and money is often an effective way of making a better machine translation system, a lot can be said for realistic expectations, careful planning and effective contributions by volunteer developers.

Case studies[edit]

The following case studies highlight four different successful development styles.

Spanish and Catalan[edit]

Long-term public funding, several developers. Catalan has around nine million speakers.

The Spanish to Catalan translator is the oldest translator in Apertium. It is a rewrite / expansion of the interNOSTRUM translator that was developed at the Universitat d'Alacant. In total, it has been developed over a period of around 12 years. The original interNOSTRUM was released in early 2000 and took around 72 person-months (four people, 18 months) to develop (both engine, and linguistic data). It was widely used, with thousands of requests per day.

In 2004, the Apertium project was started with funding from the Ministry of Science, Industry and Commerce of the Spanish State to rewrite the code as open-source, and to convert the linguistic data. After one person year, the first version of the Spanish--Catalan translator was released.

The quality of the translator is very high, and compares with commercial systems -- over 95% coverage (around 5 unknown words out of 100 words), and between 3-7% word-error rate (out of 100 words, between 3-7 need to be changed in order to get an adequate translation). It is the second most used translator on the Apertium website, and is the main machine translation system of a number of universities in Spain for this language pair.

Norwegian Nynorsk and Norwegian Bokmål[edit]

Short-term funding from a competition, one developer. Nynorsk is the "preferred standard" of approximately 580,000 Norwegians.

The Nynorsk to Bokmål translator is the most-used translator on our webpage. It was started in 2008 by Francis Tyers and Trond Trosterud, using existing resources such as the Norsk Ordbank (a large full-form list of words in Nynorsk and Bokmål) and the Oslo-Bergen tagger (a constraint grammar based disambiguator for both varieties of Norwegian).

The original implementation was never completed, but the project was taken up again in 2009 during the Google Summer of Code competition by Kevin Unhammer, a masters' student in computational linguistics at the University of Bergen. Over a period of three months, Kevin completely remade the conversion of both the Ordbank, and the constraint grammar, and wrote a series of transfer rules.

He spent 2 weeks converting the Ordbank to Apertium format, then another week converting the Oslo-Bergen tagger. Three weeks on transfer rules, and then another three weeks expanding the dictionaries. Two weeks were then spent on "cleaning up" tasks, e.g. making sure that only words that were in all three dictionaries were included. Then the final week on evaluation.

The final coverage of the system was around 90%, e.g. over a set of corpora 10 unknown words out of 100 on average. The word-error rate was around 17%, meaning that out of 100 words, 17 have to be changed in order to get an adequate translation. This is competitive with the other available system for translation Nynorsk-Bokmål. The system today accounts for over a third of all translations carried out on the Apertium website.

Breton and French[edit]

Medium-term volunteer effort with very-short term public/private funding and several developers. Breton is spoken by around 200,000 people.

Work on the Breton--French translator was started in 2008 by Francis Tyers in his spare time. After three months, a proof-of-concept system, using transfer rules from the French--Spanish pair was presented at Ofis ar Brezhoneg in December. It was decided that funding would be found to support another month of development to finish a prototype system which would be useful for assimilation purposes.

In the end, funding for the travel of a Breton speaker to Alacant was arranged by Ofis ar Brezhoneg, the Universitat d'Alacant paid for a month's wages of the Breton speaker, and Prompsit Language Engineering paid for a month of Francis Tyers' time. A further two months. In total, the monetary cost was around €3,000. The first version of the translator was released in May of 2009.

The first version had a coverage of around 85%, and a word error rate which, although high, still meant that the translator was useful for assimilation. The system is today available on the homepage of Ofis ar Brezhoneg, and updated by staff of the Ofis, including its director, Fulup Jakez.

Spanish and Aragonese[edit]

Medium-term volunteer effort with no public funding and two developers. Aragonese has around 10,000 speakers.

Work on the Spanish-Aragonese translator was begun by Apertium-developer Jim O'Regan, at the request of Aragonese-speaker Juan Pablo Martínez. After three weeks of initial effort, spread over the course of a year, a final week of concentrated effort lead to the release of the first prototype version, translating from Aragonese to Spanish only.

The first bidirectional version was completed after another 6 weeks of work by Juan Pablo, spread over the course of another year. The only available resource at the beginning of this work for Aragonese was the Aragonese edition of Wikipedia and a handful of verb templates on the English edition of Wiktionary. The Aragonese--Spanish dictionary was created by hand, but the Spanish morphological analyser/generator and part-of-speech tagger were taken from the Spanish--Catalan pair. No funding was received from any source towards the creation of the system.

Contributing factors[edit]

Existing resources[edit]

When linguistic resources, for example corpora, dictionaries, grammars, morphological analysers, lists of lemmata etc. are available under free/open-source licences they can be reused and save development time. However, the amount of time taken for adaptation should not be underestimated.

A morphological transducer designed for spell checking might make a great spell checker, but it might not be so easy to adapt it for analysis/generation in a machine translation system. Different applications have different requirements, and this should be taken into account when deciding to reuse an existing resource, to adapt it, or to start from scratch.

Objectives[edit]

It is important when starting a project to ask questions and clearly define objectives, for example,

  • Who is the target audience ?
    • Do we want the system to be used by professional translators, by lay-translators, by ordinary members of the public ?
  • What is the system intended to be used for ?
    • Assimilation: For giving an idea of what a text is about
    • Dissemination: For producing draft translations
    • Domain: Will it be used for news texts, encyclopaedic texts, legal texts, the weather, etc. ?
  • What existing linguistic resources can be reused ?
    • Are there already good, free dictionaries available ?
  • How long do we have to build the system ?
    • Six months is probably not enough enough time to build your ideal wide-domain interlingua MT system between all the languages of the Middle Volga, ...
      • ... but it might be enough to build some prototype systems for translating the weather forecast.

If a system is being planned to translate governmental texts for dissemination, then it will necessarily have different features than if it is planned to translate Wikipedia articles for assimilation.

Making a high-coverage, open domain system for assimilation and dissemination is a nice idea, but practically is not possible given limited resources.

Time[edit]

The amount of time taken to make a new language pair based on the Apertium platform depends greatly on the objectives of the project, the existing resources available and the experience of the developers. A prototype or proof of concept system can be created in anywhere from 10 days to 3 months. Whereas a production system can take from 3 months (as in the case of Nynorsk-Bokmål) to several years.

The following table summarises the development of language pairs in Apertium over the last six years.

Year Total pairs New pairs Language pairs
2005 3 3 es-ca, es-gl, es-pt
2006 6 3 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca
2007 8 2 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es
2008 18 10 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca
2009 21 3 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr
2010 23 2 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg
2011 33 10 es-ca, es-gl, es-pt, en-ca, fr-ca, oc-ca, es-ro, fr-es, oc-es, en-gl, cy-en, eo-ca, eo-es, eu-es, pt-gl, eo-en, en-es, pt-ca, nn-nb, sv-da, br-fr, es-ast, mk-bg, is-en, ca-it, eo-fr, mk-en, es-an, eu-en, es-it, sh-mk, tr-az, tr-ky

Funding[edit]

As we saw in the case studies, Apertium language pairs have been funded and created in many different ways. The following table summarises the ways in which development on the "stable" machine translation systems in Apertium was funding. The most used systems online are highlighted in bold face.

Funder Type Language pair(s)
Google Summer of Code Competition mk-bg, nn-nb, sh-mk, sv-da, tr-az, tr-ky
Volunteers eo-fr, es-an, mk-en, ca-it, eo-en
Generalitat de Catalunya Regional government oc-ca, oc-es, en-ca,
Thesis / Dissertation cy-en, fr-ca, pt-ca
Universitat d'Alacant Educational institution eu-es, (br-fr), es-pt
Ministry of Industry, Commerce and Tourism National government es-ca, es-gl
ABC Enciklopedioj Company eo-es, eo-ca
imaxin|software Company en-gl, pt-gl
Universidá d'Uviéu Educational institution es-ast
Prompsit Company es-it, (br-fr), (fr-es)
Eleka Ingenieritza Linguistikoa Company fr-es
Icelandic Research Council National government is-en
Ofis ar Brezhoneg Quasi non-governmental organisation br-fr
European Assoc. Machine Translation Non-governmental organisation eu-en

Practical[edit]

  • Search for existing linguistic resources for your language pair, making a note of the licence they are under.
  • Define a realistic work/time-plan for making a new translator for a given language pair with Apertium taking into account:
    • Existing resources
    • What the translator will be used for
    • Man power
    • Political, financial and community support