User:Darshak/Application

From Apertium
Jump to navigation Jump to search

Contact Details[edit]

Name: Darshak Parikh
Email: darshak@openmailbox.org

SourceForge username: tenebr050
IRC nick:
tenebr050
Telegram: I could share my number later.

Location: Ahmedabad, India
Time zone: UTC 0530

Proposal: Make the English-Esperanto pair state-of-the-art[edit]

The English-Esperanto language pair in Apertium is currently in a working but mediocre state. This project aims to further enhance it, so as to make the translations much more reliable than they are now, especially in the En>Eo direction.

Why this project should be sponsored[edit]

The whole idea behind Esperanto is to ease the communication between people with different native languages. People traveling to countries whose national language they are not familiar with, often rely on Esperanto, for there are Esperantists in almost every country, no matter how few. It is truly the internacia lingvo that Dr Zamenhof intended it to be.

English, on the other hand, is one of the five most spoken languages on the planet. It is arguably the lingua franca of science, gadgets, programming, the Internet, and many other things.

Currently, there is no single go-to platform for MT between these two languages. Two popular platforms are Google Translate and GramTrans, but they have their limitations:

  1. They are not available offline, therefore less accessible.
  2. They are not open source. Not everybody can contribute.
  3. They are not free as in freedom. GramTrans even disallows commercial usage by default.

Apertium is free from all of these, and that is what makes it a viable development ground for this (or any other) language pair.

Current scenario[edit]

Currently, Apertium's En>Eo translation works reasonably well for simple sentences, but when something more natural is thrown at it, a lot of errors can be seen. These are the most common areas due to which translation errors usually occur:

  • Tenses. More often than not, tenses are misunderstood. For example, simple past tense (-is) might be confused for past passive participle tense (-ita), and vice versa.
  • Prepositions work VERY differently in both languages, and there is no one-to-one mapping. For example, from might mean de or el depending on the context.
  • Inflection. At times, you might notice incorrectly identified case or number.
  • Part-of-speech ambiguities, like Mars (the planet) and mars (blots, third person singular) are not correctly understood.
  • Homonym ambiguities, like to look (great) and to look (at something) also exist.

The goal of this project is to resolve all these errors and refine the En>Eo translation so as to bring it to a state-of-the-art level.

Solutions[edit]

For the aforementioned issues, I propose the following solutions:

  • Add a constraint grammar, which is detailed enough to correctly handle tenses and prepositions. Further, part-of-speech ambiguities will also be handled by the CG.
  • Improve the structural transfer rules, for better translation of prepositions and case/number inflections.
  • Improve the lexical selection rules in order to solve homonym ambiguities.
  • More tagger training is also a good way to solve inflection issues.

Work Plan[edit]

Before the coding period[edit]

Play around with CG and LS rules. Keep adding whatever on earth I come across (proper names, multiwords, etc.).

Weeks 1-4*[edit]

Refine the structural transfer rules, with the prime focus to achieve perfection in preposition-handling.

Week 5[edit]

Add lexical selection rules to handle the most common homonyms. (Wikipedia has a list of around 100 homonyms. Could be pretty useful.)

Deliverable 1[edit]

A much refined translation with far better preposition mapping and homonym disambiguation. WER expected to go down by 10-15%.

Week 6[edit]

Start off the constraint grammar by adding rules for handling case and number inflections.

Week 7[edit]

Add CG rules to differentiate among infinitive, simple present, imperative English verbs, and between simple past and past participle ones, for they often have the same surface form.

Weeks 8-10[edit]

Add CG rules and structural transfer rules to handle the more complex Esperanto participles (-int-, -ant-, -ont-, -it-, -at-, -ot-). Also resolve the suffix issues (-anta vs -ante vs -anto).

Deliverable 2[edit]

A robust En>Eo translation with WER reduced by at least 30%.

Week 11[edit]

Thorough testing of everything done until now, complete with necessary bugfixes. Some tagger training as well.

Week 12[edit]

Documentation.

Project complete[edit]


*I have my university exams around the second half of May, dates not declared. In total, I might be occupied for up to two weeks into the coding period. However, to make up for it, I am doing nothing else after exams, and will be available all day, until the end of GSoC. So you can expect 40-45 hours of work per week.


A bit about me[edit]

I am an IT student based in Ahmedabad, India. I've been a GNU/Linux fanboy for over two years now, and am committed to using only libre/open source software.

I'm quite interested in languages, and have learnt Spanish and Esperanto. Besides, I live in a city where almost everyone can speak English, Gujarati and Hindi. So theoretically, I'm pentalingual.

You can find more about me here: https://thedubiousdisc.wordpress.com/darshak/.