Difference between revisions of "User:Darshak/Application"
(Created page with "== Contact Details == '''Name:''' Darshak Parikh<br />'''Email:''' darshak@openmailbox.org '''SourceForge username:''' tenebr050'''<br />IRC nick:''' tenebr050<br />'''Teleg...") |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
'''Location:''' Ahmedabad, India<br />'''Time zone:''' UTC 0530 |
'''Location:''' Ahmedabad, India<br />'''Time zone:''' UTC 0530 |
||
== Proposal == |
== Proposal: Make the English-Esperanto pair state-of-the-art == |
||
The English-Esperanto language pair in Apertium is currently in a working but mediocre state. This project aims to further enhance it, so as to make the translations much more reliable than they are now, especially in the En>Eo direction. |
The English-Esperanto language pair in Apertium is currently in a working but mediocre state. This project aims to further enhance it, so as to make the translations much more reliable than they are now, especially in the En>Eo direction. |
||
Line 13: | Line 13: | ||
=== Why this project should be sponsored === |
=== Why this project should be sponsored === |
||
The whole idea behind Esperanto is to ease the communication between people with different native languages. People traveling to countries whose national language they are not familiar with, often rely on Esperanto, for there are Esperantists in almost every country, no matter how few. It is truly |
The whole idea behind Esperanto is to ease the communication between people with different native languages. People traveling to countries whose national language they are not familiar with, often rely on Esperanto, for there are Esperantists in almost every country, no matter how few. It is truly the ''internacia lingvo'' that Dr Zamenhof intended it to be. |
||
English, on the other hand, is one of the five most spoken languages on the planet. It is arguably the ''lingua franca'' of science, gadgets, programming, the Internet, and many other things. |
English, on the other hand, is one of the five most spoken languages on the planet. It is arguably the ''lingua franca'' of science, gadgets, programming, the Internet, and many other things. |
||
Line 27: | Line 27: | ||
=== Current scenario === |
=== Current scenario === |
||
These are the most common areas due to which translation errors usually occur: |
Currently, Apertium's En>Eo translation works reasonably well for simple sentences, but when something more natural is thrown at it, a lot of errors can be seen. These are the most common areas due to which translation errors usually occur: |
||
* '''Tenses.''' More often than not, tenses are misunderstood. For example, simple past tense (''-is'') might be confused for past passive participle tense (''-ita''), and vice versa. |
* '''Tenses.''' More often than not, tenses are misunderstood. For example, simple past tense (''-is'') might be confused for past passive participle tense (''-ita''), and vice versa. |
||
Line 33: | Line 33: | ||
* '''Inflection.''' At times, you might notice incorrectly identified case or number. |
* '''Inflection.''' At times, you might notice incorrectly identified case or number. |
||
* '''Part-of-speech ambiguities,''' like ''Mars'' (the planet) and ''mars'' (blots, third person singular) are not correctly understood. |
* '''Part-of-speech ambiguities,''' like ''Mars'' (the planet) and ''mars'' (blots, third person singular) are not correctly understood. |
||
* '''Homonym ambiguities,''' |
* '''Homonym ambiguities,''' like ''to look'' (great) and ''to look'' (at something) also exist. |
||
The goal of this project is to resolve all these errors and refine the En>Eo translation so as to bring it to a state-of-the-art level. |
|||
=== Solutions === |
=== Solutions === |
||
Line 40: | Line 42: | ||
* '''Add a constraint grammar,''' which is detailed enough to correctly handle tenses and prepositions. Further, part-of-speech ambiguities will also be handled by the CG. |
* '''Add a constraint grammar,''' which is detailed enough to correctly handle tenses and prepositions. Further, part-of-speech ambiguities will also be handled by the CG. |
||
* '''Improve the structural transfer rules,''' for better translation of prepositions |
* '''Improve the structural transfer rules,''' for better translation of prepositions and case/number inflections. |
||
* '''Improve the lexical selection rules''' in order to solve homonym ambiguities. |
* '''Improve the lexical selection rules''' in order to solve homonym ambiguities. |
||
* '''More tagger training''' is a good way to solve inflection issues. |
* '''More tagger training''' is also a good way to solve inflection issues. |
||
=== Work Plan === |
=== Work Plan === |
||
Line 48: | Line 50: | ||
==== Before the coding period ==== |
==== Before the coding period ==== |
||
Play around with CG and LS rules. Keep adding whatever on earth I come across (proper names, multiwords, etc.). |
|||
* Keep adding whatever on earth I come across (proper names, multiwords, etc.). |
|||
==== Weeks 1-4* ==== |
==== Weeks 1-4* ==== |
||
Refine the structural transfer rules |
Refine the structural transfer rules, with the prime focus to achieve perfection in preposition-handling. |
||
==== Week 5 ==== |
==== Week 5 ==== |
||
Line 61: | Line 62: | ||
==== Deliverable 1 ==== |
==== Deliverable 1 ==== |
||
A much refined translation with far better preposition mapping and homonym disambiguation. |
A much refined translation with far better preposition mapping and homonym disambiguation. WER expected to go down by 10-15%. |
||
==== Week 6 ==== |
==== Week 6 ==== |
||
Line 73: | Line 74: | ||
==== Weeks 8-10 ==== |
==== Weeks 8-10 ==== |
||
Add CG rules to handle the more complex Esperanto participles (''-int-'', ''-ant-'', ''-ont-'', ''-it-'', ''-at-'', ''-ot-''). Also resolve the suffix issues (''-anta'' vs ''-ante'' vs ''-anto''). |
Add CG rules and structural transfer rules to handle the more complex Esperanto participles (''-int-'', ''-ant-'', ''-ont-'', ''-it-'', ''-at-'', ''-ot-''). Also resolve the suffix issues (''-anta'' vs ''-ante'' vs ''-anto''). |
||
==== Deliverable 2 ==== |
==== Deliverable 2 ==== |
||
A robust En>Eo translation with |
A robust En>Eo translation with WER reduced by at least 30%. |
||
==== |
==== Week 11 ==== |
||
Thorough testing of everything done until now, complete with necessary bugfixes. |
Thorough testing of everything done until now, complete with necessary bugfixes. Some tagger training as well. |
||
==== Week 12 ==== |
|||
Documentation. |
|||
==== Project complete ==== |
==== Project complete ==== |
||
Line 93: | Line 98: | ||
== A bit about me == |
== A bit about me == |
||
I am an IT student based in Ahmedabad, India. I've been a GNU/Linux fanboy for over two years now, and am |
I am an IT student based in Ahmedabad, India. I've been a GNU/Linux fanboy for over two years now, and am committed to using only libre/open source software. |
||
I'm quite interested in languages, and have learnt Spanish and Esperanto. Besides, I live in a city where almost everyone can speak English, Gujarati and Hindi. So theoretically, I'm pentalingual. |
I'm quite interested in languages, and have learnt Spanish and Esperanto. Besides, I live in a city where almost everyone can speak English, Gujarati and Hindi. So theoretically, I'm pentalingual. |
||
You can find more about me here: https://thedubiousdisc.wordpress.com/darshak/ |
You can find more about me here: https://thedubiousdisc.wordpress.com/darshak/. |
||
[[Category:GSoC 2014 Student proposals|Darshak]] |
Latest revision as of 08:36, 22 March 2014
Contents
Contact Details[edit]
Name: Darshak Parikh
Email: darshak@openmailbox.org
SourceForge username: tenebr050
IRC nick: tenebr050
Telegram: I could share my number later.
Location: Ahmedabad, India
Time zone: UTC 0530
Proposal: Make the English-Esperanto pair state-of-the-art[edit]
The English-Esperanto language pair in Apertium is currently in a working but mediocre state. This project aims to further enhance it, so as to make the translations much more reliable than they are now, especially in the En>Eo direction.
Why this project should be sponsored[edit]
The whole idea behind Esperanto is to ease the communication between people with different native languages. People traveling to countries whose national language they are not familiar with, often rely on Esperanto, for there are Esperantists in almost every country, no matter how few. It is truly the internacia lingvo that Dr Zamenhof intended it to be.
English, on the other hand, is one of the five most spoken languages on the planet. It is arguably the lingua franca of science, gadgets, programming, the Internet, and many other things.
Currently, there is no single go-to platform for MT between these two languages. Two popular platforms are Google Translate and GramTrans, but they have their limitations:
- They are not available offline, therefore less accessible.
- They are not open source. Not everybody can contribute.
- They are not free as in freedom. GramTrans even disallows commercial usage by default.
Apertium is free from all of these, and that is what makes it a viable development ground for this (or any other) language pair.
Current scenario[edit]
Currently, Apertium's En>Eo translation works reasonably well for simple sentences, but when something more natural is thrown at it, a lot of errors can be seen. These are the most common areas due to which translation errors usually occur:
- Tenses. More often than not, tenses are misunderstood. For example, simple past tense (-is) might be confused for past passive participle tense (-ita), and vice versa.
- Prepositions work VERY differently in both languages, and there is no one-to-one mapping. For example, from might mean de or el depending on the context.
- Inflection. At times, you might notice incorrectly identified case or number.
- Part-of-speech ambiguities, like Mars (the planet) and mars (blots, third person singular) are not correctly understood.
- Homonym ambiguities, like to look (great) and to look (at something) also exist.
The goal of this project is to resolve all these errors and refine the En>Eo translation so as to bring it to a state-of-the-art level.
Solutions[edit]
For the aforementioned issues, I propose the following solutions:
- Add a constraint grammar, which is detailed enough to correctly handle tenses and prepositions. Further, part-of-speech ambiguities will also be handled by the CG.
- Improve the structural transfer rules, for better translation of prepositions and case/number inflections.
- Improve the lexical selection rules in order to solve homonym ambiguities.
- More tagger training is also a good way to solve inflection issues.
Work Plan[edit]
Before the coding period[edit]
Play around with CG and LS rules. Keep adding whatever on earth I come across (proper names, multiwords, etc.).
Weeks 1-4*[edit]
Refine the structural transfer rules, with the prime focus to achieve perfection in preposition-handling.
Week 5[edit]
Add lexical selection rules to handle the most common homonyms. (Wikipedia has a list of around 100 homonyms. Could be pretty useful.)
Deliverable 1[edit]
A much refined translation with far better preposition mapping and homonym disambiguation. WER expected to go down by 10-15%.
Week 6[edit]
Start off the constraint grammar by adding rules for handling case and number inflections.
Week 7[edit]
Add CG rules to differentiate among infinitive, simple present, imperative English verbs, and between simple past and past participle ones, for they often have the same surface form.
Weeks 8-10[edit]
Add CG rules and structural transfer rules to handle the more complex Esperanto participles (-int-, -ant-, -ont-, -it-, -at-, -ot-). Also resolve the suffix issues (-anta vs -ante vs -anto).
Deliverable 2[edit]
A robust En>Eo translation with WER reduced by at least 30%.
Week 11[edit]
Thorough testing of everything done until now, complete with necessary bugfixes. Some tagger training as well.
Week 12[edit]
Documentation.
Project complete[edit]
*I have my university exams around the second half of May, dates not declared. In total, I might be occupied for up to two weeks into the coding period. However, to make up for it, I am doing nothing else after exams, and will be available all day, until the end of GSoC. So you can expect 40-45 hours of work per week.
A bit about me[edit]
I am an IT student based in Ahmedabad, India. I've been a GNU/Linux fanboy for over two years now, and am committed to using only libre/open source software.
I'm quite interested in languages, and have learnt Spanish and Esperanto. Besides, I live in a city where almost everyone can speak English, Gujarati and Hindi. So theoretically, I'm pentalingual.
You can find more about me here: https://thedubiousdisc.wordpress.com/darshak/.