Difference between revisions of "User:Pmodi/GSOC 2020 proposal: Hindi-Punjabi"

From Apertium
Jump to navigation Jump to search
Line 50: Line 50:
   
 
It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.
 
It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.
  +
  +
== Skills ==
  +
I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I'm working on tasks relating to Summarization and knowledge graphs.
  +
  +
I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.
  +
  +
I enjoy reading research papers and I have analysed several research papers in preparation for this proposal, all of which have been cited. I also have a lot of experience studying data which I feel is essential in solving any problem.
  +
  +
Due to the focused nature of our course, I have worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.
  +
  +
I am fluent in English, Hindi and Punjabi.

Revision as of 18:41, 20 March 2020

Contact Information

Name: Priyank Modi
Email: priyankmodi99@gmail.com
Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (completing 6th semester/3rd year in April '20) and a Teaching Assistant for Linguistics courses
IRC: pmodi
Timezone: GMT +0530 hrs
Linkedin: https://www.linkedin.com/in/priyank-modi-81584b175/
Github: https://github.com/priyankmodiPM

Why I am interested in Apertium

Because Apertium is free/open-source software.
Because its community is strongly committed to under-resourced and minoritised/marginalised languages.
Because there is lot of good work done and being done in it.
Because it is not only machine translation, but also free resources than can be used for other purposes: e.g. dictionaries, morphological analysers, spellcheckers, etc.

Which of the published tasks are you interested in? What do you plan to do?

Adopt an unreleased language pair. I plan on developing the Hindi-Punjabi language pair in both directions i.e. hin-pan and pan-hin. This'll involve improving the monolingual dictionaries for both languages, the hin-pan bilingual dictionary and writing suitable transfer rules to bring this pair to a releasable state.

My Proposal

Why Google and Apertium should sponsor it

  • Both Hindi and Punjabi are widely spoken languages, both by number of speakers and geographic spread. Despite that, Punjabi especially has very limited online resources.
  • Services like Google Translate give unsatisfactory results when it comes to translation of this pair(see Section 2.1) On the contrary, I was able to achieve close to human translation for some sentences using minimal rules and time(see Section 3 : Coding Challenge).
  • I believe the Apertium architecture is suited perfectly for this pair and can replace the current state-of-art translator for this pair.
  • This is an important project(since it adds diversity to Apertium and translation systems in general) which requires at least 2-3 months of dedicated work and can be an important resource.

How and who it will benefit in society

As mentioned above, the Apertium community is strongly committed to under-resourced and minoritised/marginalised languages and Google helps its own way via programs like GSoC and GCI. There exist many local cultural movements in Africa with the goal of developing language and opening to the world but they generally fail to duel on a scientific basis. This project will definitely mark a starting point or proof of concept in Machine Translation in Cameroon and will greatly have a positive impact on language development.

Google Translate : Analysis and comparison

Google and Yandex propose the three languages in question, so one can translate between them using these services. I have analysed the results of the translation into Catalan from Google (my experience tells me that Yandex gives worse results for Catalan). The numerical results can be seen in the previous sections (ita-cat: 14.0% WER, por-cat: 21.6% WER).

The results are far from wonderful, especially for the Portuguese-Catalan pair. Seemingly, Google translates using both Spanish and English as bridge languages, as can be seen, for example, by words that appear in these two languages in the final text (supposedly in Catalan) and that were not in the original Italian or Portuguese text. The use of English as intermediate between Romance languages causes problems known to all users, such as the translation of p2.pl verb forms with elided subject to p2.sg, the incorrect choice of past times in the verbs and the disappearance of some pronouns. Here is an example of the last case of the Italian test text (randomly obtained):

Original text (bold mine):

altri invece ne hanno apprezzato la spontaneità, la tenacia e l'affettuosità

Google translation:

altres han apreciat la seva espontaneïtat, tenacitat i afecte

Post-edited translation:

altres n'han apreciat l'espontaneïtat, tenacitat i afecte

It should be added that, although Google translations tend to be more phraseological than the ones obtained by rules, they are also much more difficult to post-edit. The reason is that, while the translation by rules often makes evident and even expected errors, the neuronal translation significantly changes the text, reordering parts of the sentence, removing or putting words, changing singular to plural or plural to singular (!), and modifying expressions. The evaluation of whether the meaning is the same as the original requires a lot more time. This has been quite clear when I have made the post-edition of both the Apertium and Google translations for the Italian and Portuguese texts.

Skills

I'm currently a third year student at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree course where we study Computer Science, Linguistics, NLP and more. I'm working on tasks relating to Summarization and knowledge graphs.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Algorithms. Data Structures, and Machine Learning Algorithms as well.

I enjoy reading research papers and I have analysed several research papers in preparation for this proposal, all of which have been cited. I also have a lot of experience studying data which I feel is essential in solving any problem.

Due to the focused nature of our course, I have worked in several projects, such as building Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and Punjabi.