Difference between revisions of "User:Darshak/Application"

From Apertium
Jump to navigation Jump to search
(Created page with "== Contact Details == '''Name:''' Darshak Parikh<br />'''Email:''' darshak@openmailbox.org '''SourceForge username:''' tenebr050'''<br />IRC nick:''' tenebr050<br />'''Teleg...")
 
(No difference)

Revision as of 07:52, 21 March 2014

Contact Details

Name: Darshak Parikh
Email: darshak@openmailbox.org

SourceForge username: tenebr050
IRC nick:
tenebr050
Telegram: I could share my number later.

Location: Ahmedabad, India
Time zone: UTC 0530

Proposal

The English-Esperanto language pair in Apertium is currently in a working but mediocre state. This project aims to further enhance it, so as to make the translations much more reliable than they are now, especially in the En>Eo direction.

Why this project should be sponsored

The whole idea behind Esperanto is to ease the communication between people with different native languages. People traveling to countries whose national language they are not familiar with, often rely on Esperanto, for there are Esperantists in almost every country, no matter how few. It is truly the internacia lingvo that Dr Zamenhof intended it to be.

English, on the other hand, is one of the five most spoken languages on the planet. It is arguably the lingua franca of science, gadgets, programming, the Internet, and many other things.

Currently, there is no single go-to platform for MT between these two languages. Two popular platforms are Google Translate and GramTrans, but they have their limitations:

  1. They are not available offline, therefore less accessible.
  2. They are not open source. Not everybody can contribute.
  3. They are not free as in freedom. GramTrans even disallows commercial usage by default.

Apertium is free from all of these, and that is what makes it a viable development ground for this (or any other) language pair.

Current scenario

These are the most common areas due to which translation errors usually occur:

  • Tenses. More often than not, tenses are misunderstood. For example, simple past tense (-is) might be confused for past passive participle tense (-ita), and vice versa.
  • Prepositions work VERY differently in both languages, and there is no one-to-one mapping. For example, from might mean de or el depending on the context.
  • Inflection. At times, you might notice incorrectly identified case or number.
  • Part-of-speech ambiguities, like Mars (the planet) and mars (blots, third person singular) are not correctly understood.
  • Homonym ambiguities, like to look (great) and to look (at something) also exist.

Solutions

For the aforementioned issues, I propose the following solutions:

  • Add a constraint grammar, which is detailed enough to correctly handle tenses and prepositions. Further, part-of-speech ambiguities will also be handled by the CG.
  • Improve the structural transfer rules, for better translation of prepositions. For instance, there could be a rule stating that when from comes before a toponym, it should translate to el.
  • Improve the lexical selection rules in order to solve homonym ambiguities.
  • More tagger training is a good way to solve inflection issues.

Work Plan

Before the coding period

  • Play around with CG and LS rules. Maybe even add a few.
  • Keep adding whatever on earth I come across (proper names, multiwords, etc.).

Weeks 1-4*

Refine the structural transfer rules related to prepositions, with an aim to achieve perfection in preposition-handling.

Week 5

Add lexical selection rules to handle the most common homonyms. (Wikipedia has a list of around 100 homonyms. Could be pretty useful.)

Deliverable 1

A much refined translation with far better preposition mapping and homonym disambiguation.

Week 6

Start off the constraint grammar by adding rules for handling case and number inflections.

Week 7

Add CG rules to differentiate among infinitive, simple present, imperative English verbs, and between simple past and past participle ones, for they often have the same surface form.

Weeks 8-10

Add CG rules to handle the more complex Esperanto participles (-int-, -ant-, -ont-, -it-, -at-, -ot-). Also resolve the suffix issues (-anta vs -ante vs -anto).

Deliverable 2

A robust En>Eo translation with a greatly reduced WER

Weeks 11-12

Thorough testing of everything done until now, complete with necessary bugfixes.

Project complete


*I have my university exams around the second half of May, dates not declared. In total, I might be occupied for up to two weeks into the coding period. However, to make up for it, I am doing nothing else after exams, and will be available all day, until the end of GSoC. So you can expect 40-45 hours of work per week.


A bit about me

I am an IT student based in Ahmedabad, India. I've been a GNU/Linux fanboy for over two years now, and am quite committed to using only libre/open source software.

I'm quite interested in languages, and have learnt Spanish and Esperanto. Besides, I live in a city where almost everyone can speak English, Gujarati and Hindi. So theoretically, I'm pentalingual.

You can find more about me here: https://thedubiousdisc.wordpress.com/darshak/