Difference between revisions of "User:Iamas/GSoC13 Application: "Improved Bilingual Dictionary Induction""

From Apertium
Jump to navigation Jump to search
 
(39 intermediate revisions by 2 users not shown)
Line 6: Line 6:
== Contact Information ==
== Contact Information ==


'''E-mail''' : [mailto:arnavsharma93@gmail.com arnavsharma93@gmail.com] <br/>
*'''E-mail''' : [mailto:arnavsharma93@gmail.com arnavsharma93@gmail.com] <br/>
'''Facebook''' : arnavsharma93 <br/>
*'''GitHub''' : arnavsharma93 <br/>
'''IRC''' : iamas, arnavsharma93 <br/>
*'''IRC''' : iamas, arnavsharma93 <br/>
*'''SourceForge''' : iamas

== Why am I interested in Machine Translation? ==
Machine Translation is an important technology for localization, and is particularly relevant in a linguistically diverse country like India. Machine Translation can help reduce the language barrier. That motivated me to study Computational Linguistics in IIIT-H. I am currently working in the Machine Translation Department of IIIT-H.
== Why am I interested in the Apertium Project? ==
I have been fascinated by FOSS and open source software since the time I heard about it. As mentioned above, Machine Translation and computational linguisitics interests me a lot. Apertium combines both of these factors. Plus, I really like Begiak.


== Why I am interested in Machine Translation? ==
== Why are you interested in the Apertium Project? ==
== Which of the published tasks am I interested in? What do I plan to do? ==
== Which of the published tasks am I interested in? What do I plan to do? ==
I am interested in the project '''Improved Bilingual Dictionary Induction'''. The aim is to write a set of scripts that generate valid entries for a bidix from the word aligned parallel corpus and also to evaluate the reliability of the extracted translations. There are no methods that could enable the fully automatic production of dictionaries. Thus, the creation of a completely clean lexicographical resource
with an appropriate coverage requires a manual post-editing phase. Accordingly, my goal is to provide lexicographers with resources diminishing as much as possible the amount of labor required to prepare full-fledged dictionaries for Apertium's usage.
===Advantages of using parallel corpora in dictionary creation===
*High-quality dictionaries are based on corpora. This linguistic data decreases the role of human intuition during lexicographic process.
*Corpus-driven nature of this method ensures that human insight is eliminated also when hunting for possible translation candidates, that is, when establishing possible pairings of the source language and the target language expressions.
*The method we will be using will rank the translation candidates according to how likely they are based on automatically determined translational probabilities. This in turn renders possible to determine which sense of a given lemma is the most frequently used. Thus, representative corpora guarantees that not only the most important source lemmata will be included in the dictionary – as in traditional corpus-based lexicography – but also the translations of their most relevant senses.

==Proposal Title==
==Proposal Title==
'''Improved bilingual dictionary induction'''
'''Improved bilingual dictionary induction'''
== Why Apertium and Google should translate it? ==
== Why Apertium and Google should sponsor it? ==
Bilingual Dictionary is one of the five main dictionaries used in Apertium. This project involves generating valid and consistent Apertium bilingual dictionary entries from a word-aligned parallel corpus. There exist such tools but most of the generated entries have to be checked, which can greatly increase the amount of time it takes to make a new translation system. This will greatly benefit the lexicographers and other contributors and will help in reducing the effort and time taken to make new translation system.
== A description on who and how it will benefit the society?==

== Work Plan==
== Work Plan==
=== Coding Challenge ===
=== Coding Challenge ===
The coding challenge involved:
=== Community bonding period ===

* Install [[Apertium]]
* Install [[GIZA++]]
* Generate a word alignment model for a parallel corpus of your choice.
* Rewrite the script [https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-forms-server/scripts/generate-bidix-templates.py generate-bidix-templates.py] to use python3/ElementTree.
<br/>
I have finished the coding challenge.
* Link can be found on github [https://github.com/arnavsharma93/CodingChallengeApertium here].
* Please refer to the [https://github.com/arnavsharma93/CodingChallengeApertium/blob/master/README.md README] for further details.

=== Interim period and community bonding period ===
*Get to know the community better
*Habituate myself with the Apertium platform and project
*Make preparations and gain necessary information that will help me in the coding period.
*Contribute by solving bugs, rewriting scripts and contributing to the language pairs Hindi-Punjabi and Hindi-Urdu.

=== Week Plan ===
=== Week Plan ===
{| class="wikitable" border="1"
{| class="wikitable" border="1"
|-
|-
!WEEK
!week
!DATE
!plans
!PLANS
|-
|-
|Week 01
|01
|06.17-06.23
|
|Choose at least three language pairs with varying degree of relatedness and make word aligned data after running morph analyzer on a parallel corpus of these language pairs. Also, decide on the various factors which will act as filters for the entries such as probability/frequency that will be used to create a first level of dictionary.
|-
|-
|Week 02
|02
|06.24-06.30
|
|Write a script to generate the list of word mappings. Use the filters decided above to get the mappings in both directions.
|-
|-
|Week 03
|03 & 04
|07.01-07.14
|
|Evaluate the dictionary created manually by random sampling a few entries against a web dictionary. If accuracy satisfactory, continue with those factors for the language pair if not chose other factors and re-evaluate.
|-
|Week 04
|
|-
|-
| '''Deliverable #1'''
| '''Deliverable #1'''
|
|
|Developed the first level of bilingual dictionary based on statistical parameters.
|-
|-
|Week 05
|05 & 06
|07.15-07.28
|
| Write the script which generates templates for user's selection by using the most frequent combinations of paradigms in Source Language - Transfer Language.
|-
|-
|07 & 08
|Week 06
|07.29-08.11
|
|Write a script to make bidix entries in an incremental fashion by checking to see whether the source language paradigm has a template with the transfer language paradigm..
|-
|-
|'''Deliverable #2'''
|Week 07
|
|
|Script that makes bidix entries.
|-
|-
|09
|Week 08
|08.12-08.18
|
|Create a mini-testvoc for the added words and see that the entries pass it.
|-
|-
|10 & 11
|'''Deliverable #2'''
|08.19-09.01
|
|Create a mini-testvoc for the added words and see that the entries pass it. Compare the bidix dictionary against some online web dictionaries
|-
|-
|12
|Week 09
|09.02-09.08
|
|Combine the scripts to automate as much of the process as possible. Improve code, remove bugs, add lots of comments and write wiki for all of the scripts usage.
|-
|Week 10
|
|-
|-
|'''Deliverable #3'''
|'''Deliverable #3'''
|
|
|Final project
|-
|Week 11
|
|-
|Week 12
|
|-
|'''Deliverable #Final'''
|
|}
|}

== Biography ==
== Biography ==
I am currently pursuing Bachelor of Technology in Computer Science and MS by Research in Computational Linguistics at IIIT-H. I have just finished my second year in that. I have been studying the various fields of Computational Linguistics for the past two years and I can not wait to study more. I am proficient in Python, C/C++, Bash, SQL and HTML5. I have developed an Urdu-Hindi transliterator using NLP tools. It gave an accuracy of 75%.
== Skills and evidence of qualifications ==

== Any non-Summer-of-Code plans for the summer ==
== Non-Summer-of-Code plans for the summer ==
I might have to go for a social entrepreneurship trip for 3 days in July. Also, I plan on improving my programming skills by taking part in algorithmic coding competitions. Otherwise, I have nothing else planned for the summer. This project will be my main priority.


[[Category:GSoC 2013 Student proposals|Arnav]]

Latest revision as of 17:50, 3 May 2013

Name[edit]

Arnav Sharma

Contact Information[edit]

Why am I interested in Machine Translation?[edit]

Machine Translation is an important technology for localization, and is particularly relevant in a linguistically diverse country like India. Machine Translation can help reduce the language barrier. That motivated me to study Computational Linguistics in IIIT-H. I am currently working in the Machine Translation Department of IIIT-H.

Why am I interested in the Apertium Project?[edit]

I have been fascinated by FOSS and open source software since the time I heard about it. As mentioned above, Machine Translation and computational linguisitics interests me a lot. Apertium combines both of these factors. Plus, I really like Begiak.

Which of the published tasks am I interested in? What do I plan to do?[edit]

I am interested in the project Improved Bilingual Dictionary Induction. The aim is to write a set of scripts that generate valid entries for a bidix from the word aligned parallel corpus and also to evaluate the reliability of the extracted translations. There are no methods that could enable the fully automatic production of dictionaries. Thus, the creation of a completely clean lexicographical resource with an appropriate coverage requires a manual post-editing phase. Accordingly, my goal is to provide lexicographers with resources diminishing as much as possible the amount of labor required to prepare full-fledged dictionaries for Apertium's usage.

Advantages of using parallel corpora in dictionary creation[edit]

  • High-quality dictionaries are based on corpora. This linguistic data decreases the role of human intuition during lexicographic process.
  • Corpus-driven nature of this method ensures that human insight is eliminated also when hunting for possible translation candidates, that is, when establishing possible pairings of the source language and the target language expressions.
  • The method we will be using will rank the translation candidates according to how likely they are based on automatically determined translational probabilities. This in turn renders possible to determine which sense of a given lemma is the most frequently used. Thus, representative corpora guarantees that not only the most important source lemmata will be included in the dictionary – as in traditional corpus-based lexicography – but also the translations of their most relevant senses.

Proposal Title[edit]

Improved bilingual dictionary induction

Why Apertium and Google should sponsor it?[edit]

Bilingual Dictionary is one of the five main dictionaries used in Apertium. This project involves generating valid and consistent Apertium bilingual dictionary entries from a word-aligned parallel corpus. There exist such tools but most of the generated entries have to be checked, which can greatly increase the amount of time it takes to make a new translation system. This will greatly benefit the lexicographers and other contributors and will help in reducing the effort and time taken to make new translation system.

Work Plan[edit]

Coding Challenge[edit]

The coding challenge involved:


I have finished the coding challenge.

  • Link can be found on github here.
  • Please refer to the README for further details.

Interim period and community bonding period[edit]

  • Get to know the community better
  • Habituate myself with the Apertium platform and project
  • Make preparations and gain necessary information that will help me in the coding period.
  • Contribute by solving bugs, rewriting scripts and contributing to the language pairs Hindi-Punjabi and Hindi-Urdu.

Week Plan[edit]

WEEK DATE PLANS
01 06.17-06.23 Choose at least three language pairs with varying degree of relatedness and make word aligned data after running morph analyzer on a parallel corpus of these language pairs. Also, decide on the various factors which will act as filters for the entries such as probability/frequency that will be used to create a first level of dictionary.
02 06.24-06.30 Write a script to generate the list of word mappings. Use the filters decided above to get the mappings in both directions.
03 & 04 07.01-07.14 Evaluate the dictionary created manually by random sampling a few entries against a web dictionary. If accuracy satisfactory, continue with those factors for the language pair if not chose other factors and re-evaluate.
Deliverable #1 Developed the first level of bilingual dictionary based on statistical parameters.
05 & 06 07.15-07.28 Write the script which generates templates for user's selection by using the most frequent combinations of paradigms in Source Language - Transfer Language.
07 & 08 07.29-08.11 Write a script to make bidix entries in an incremental fashion by checking to see whether the source language paradigm has a template with the transfer language paradigm..
Deliverable #2 Script that makes bidix entries.
09 08.12-08.18 Create a mini-testvoc for the added words and see that the entries pass it.
10 & 11 08.19-09.01 Create a mini-testvoc for the added words and see that the entries pass it. Compare the bidix dictionary against some online web dictionaries
12 09.02-09.08 Combine the scripts to automate as much of the process as possible. Improve code, remove bugs, add lots of comments and write wiki for all of the scripts usage.
Deliverable #3 Final project

Biography[edit]

I am currently pursuing Bachelor of Technology in Computer Science and MS by Research in Computational Linguistics at IIIT-H. I have just finished my second year in that. I have been studying the various fields of Computational Linguistics for the past two years and I can not wait to study more. I am proficient in Python, C/C++, Bash, SQL and HTML5. I have developed an Urdu-Hindi transliterator using NLP tools. It gave an accuracy of 75%.

Non-Summer-of-Code plans for the summer[edit]

I might have to go for a social entrepreneurship trip for 3 days in July. Also, I plan on improving my programming skills by taking part in algorithmic coding competitions. Otherwise, I have nothing else planned for the summer. This project will be my main priority.