Difference between revisions of "User:Varunshenoy"

From Apertium
Jump to navigation Jump to search
(Created page with "'''Name:''' Varun V Shenoy '''E-mail address:''' shenoyvvarun@gmail.com '''Other information that may be useful to contact you:''' phone number (+91)9738217698 '''URL To...")
 
Line 7: Line 7:
 
'''URL To the Coding Challenge:''' https://github.com/shenoyvvarun/apertium-challenge/blob/master/challenge1.py
 
'''URL To the Coding Challenge:''' https://github.com/shenoyvvarun/apertium-challenge/blob/master/challenge1.py
   
The tool, takes the target and source setences using the output of morphological analyser and once we have the contractions expanded. Finds all pairs, translates all of them in a single shot using the apertium process and tries to find these translations in the input translated sentence.
+
The tool, takes the target and source sentences using the output of morphological analyser and once we have the contractions expanded. Finds all pairs, translates all of them in a single shot using the apertium process and tries to find these translations in the input translated sentence.
   
 
'''Why is it you are interested in machine translation?'''
 
'''Why is it you are interested in machine translation?'''
Line 23: Line 23:
 
'''URL:''' http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair
 
'''URL:''' http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair
   
  +
'''Proposal:'''
+
== Proposal: ==
   
 
'''Title:''' Command-line translation memory fuzzy-match repair
 
'''Title:''' Command-line translation memory fuzzy-match repair
Line 33: Line 34:
 
'''A description of how and who it will benefit in society.'''
 
'''A description of how and who it will benefit in society.'''
   
Human translators who have previously used Apertium to translate documents and later fixed the incorrect translations will be benifited. Now, the human translator could add his translations to the TMX file and the next time if there is a sentence which is close to one of his fixed sentences in the TMX file, it would use that translation. Apertium is used here to provide resources for the fuzzy match repair rather than translating the phrases by the analogy translation principle with proper examples as its reference [2]. When a change is found between the reference and "to be" translated sentence, the clippings around the change can be translated using Apertium.
+
Human translators who have previously used Apertium to translate documents and later fixed the incorrect translations will be benefited. Now, the human translator could add his translations to the TMX file and the next time if there is a sentence which is close to one of his fixed sentences in the TMX file, it would use that translation. Apertium is used here to provide resources for the fuzzy match repair rather than translating the phrases by the analogy translation principle with proper examples as its reference [2]. When a change is found between the reference and "to be" translated sentence, the clippings around the change can be translated using Apertium.
   
 
'''Detailed Work Plan.'''
 
'''Detailed Work Plan.'''
   
 
Translation memory allows the human translator to store his higher quality translations into a TMX file[1] that can be used as cache for future translations. But this, doesn't exploit the sub-segment level repetitions(similar sentences) that happens in texts. A tool which does a repair on the sentence using the sub-segment level repetitions could improve the quality of translations.
 
Translation memory allows the human translator to store his higher quality translations into a TMX file[1] that can be used as cache for future translations. But this, doesn't exploit the sub-segment level repetitions(similar sentences) that happens in texts. A tool which does a repair on the sentence using the sub-segment level repetitions could improve the quality of translations.
First, we find the fuzzy match score between the reference and the new source sentence. Candidates from the tmx file are chosen which have a fuzzy match score more than a threshold provided by the user. The best case would be the minimal pair, meaning that the sentences vary by just one element. Fuzzy match score would be calculated using (number of words that are same in place )/(total number of words). This measure would take an asymptotic time complexity of O(n^2) and can be easily found.
+
First, we find the fuzzy match score between the reference and the new source sentence. Candidates from the TMX file are chosen which have a fuzzy match score more than a threshold provided by the user. The best case would be the minimal pair, meaning that the sentences vary by just one element. Fuzzy match score would be calculated using (number of words that are same in place )/(total number of words). This measure would take an asymptotic time complexity of O(n^2) and can be easily found.
 
 
 
Instead of using Learning techniques [2] that statistically infer translations using Translation Memory. We could use a combination of TM (Translation Memory) and MT (Machine Translation). We could use MT to find additional resource for fuzzy-match repair. This would be efficient since MT translation system gives better results than going through the For this method to be successful the translator's own data is considered to be of higher quality. Now the translation is expected to better than MT for the entire segment and the normal fuzzy match.
 
Instead of using Learning techniques [2] that statistically infer translations using Translation Memory. We could use a combination of TM (Translation Memory) and MT (Machine Translation). We could use MT to find additional resource for fuzzy-match repair. This would be efficient since MT translation system gives better results than going through the For this method to be successful the translator's own data is considered to be of higher quality. Now the translation is expected to better than MT for the entire segment and the normal fuzzy match.
Now, we would require to cut the clippings around the changes. We will use a left to right scan to generate sub-segements and keep those segments which cover "changes". Now, we will use the Apertium tool to translate the clippings for us. We will use heuristics and discard many sub-segments, avoiding many incompatibilities. These filtered translated clippings as a whole contiguous block are looked for in the reference translated sentence, indices where they are found are noted, the ones not found are discarded. Now, the change is substituted in the clippings and these modified clippings are translated again using Apertium tool to form repair operators. Repair operators are substituted at the indices determined earlier to obtain t'. This method to find approximate translation of s is described in the wiki [4]
+
Now, we would require to cut the clippings around the changes. We will use a left to right scan to generate sub-segments and keep those segments which cover "changes". Now, we will use the Apertium tool to translate the clippings for us. We will use heuristics and discard many sub-segments, avoiding many incompatibilities. These filtered translated clippings as a whole contiguous block are looked for in the reference translated sentence, indices where they are found are noted, the ones not found are discarded. Now, the change is substituted in the clippings and these modified clippings are translated again using Apertium tool to form repair operators. Repair operators are substituted at the indices determined earlier to obtain t'. This method to find approximate translation of s is described in the wiki [4]
   
   
 
'''Implementation Details:'''
 
'''Implementation Details:'''
   
I plan to use python (I know Java and C++ and I am open to using them). First we have to parse and read information out of the tmx finding our best matches. I plan on using the lxml library which is pretty efficient and fast. Apertium already searches for matches in the tmx file. The source code of this feature will be used to fasten the development. Once candidates have been found I will be using the algorithm discussed above. To make the translation, I will be forking the apertium process. I plan on minimising the number of fork calls made by making all the translations at one shot.
+
I plan to use python (I know Java and C++ and I am open to using them). First we have to parse and read information out of the tmx finding our best matches. I plan on using the lxml library which is pretty efficient and fast. Apertium already searches for matches in the tmx file. The source code of this feature will be used to fasten the development. Once candidates have been found I will be using the algorithm discussed above. To make the translation, I will be forking the Apertium process. I plan on minimising the number of fork calls made by making all the translations at one shot.
   
   
Line 60: Line 61:
   
 
Week 1 and 2: Time to think, research and familiarise my self with Apertium's feature which searches for matches in the TMX file and the Apertium tool.
 
Week 1 and 2: Time to think, research and familiarise my self with Apertium's feature which searches for matches in the TMX file and the Apertium tool.
  +
Week 3: Search in the tmx file to find the best candidate.
+
Week 3: Search in the TMX file to find the best candidate.
  +
 
Week 4: An algorithm to score and detect changes between two sentences.
 
Week 4: An algorithm to score and detect changes between two sentences.
  +
Deliverable #1: A program that given a tmx file and sentence, finds the changes between reference sentence and given sentence.
+
Deliverable #1: A program that given a TMX file and sentence, finds the changes between reference sentence and given sentence.
Week 5:Implement an algorithm that gives out all the sub-segments that contains the changes and use the segments to come up with repair operators..
 
  +
Week 6: Build the best match. Think of heuristics to avoid many incompatitbilties(in repair operators).
 
 
Week 5:Implement an algorithm that gives out all the sub-segments that contains the changes and use the segments to come up with repair operators.
  +
 
Week 6: Build the best match. Think of heuristics to avoid many incompatibilities(in repair operators).
  +
 
Week 7: Code heuristics to avoid many incompatibilities in repair operator.
 
Week 7: Code heuristics to avoid many incompatibilities in repair operator.
 
Week 8: Obtain approximate translations by using the repair operators.
 
Week 8: Obtain approximate translations by using the repair operators.
Deliverable #2: A program that given tmx file and a sentence, gives approximate translation of the sentence.
+
Deliverable #2: A program that given TMX file and a sentence, gives approximate translation of the sentence.
 
Week 9: Benchmark the quality of translations of the tool. We can hold out a part of TM which has a good fuzzy match score against the part of the TM which is available for the translations. Now these sentences will be used to generate translations using fuzzy match repair. We will now measure the quality of translations by comparing it with the reference translations already available.
 
Week 9: Benchmark the quality of translations of the tool. We can hold out a part of TM which has a good fuzzy match score against the part of the TM which is available for the translations. Now these sentences will be used to generate translations using fuzzy match repair. We will now measure the quality of translations by comparing it with the reference translations already available.
 
Week 10: Documentation, Quality Testing and Bug fixing
 
Week 10: Documentation, Quality Testing and Bug fixing
 
Week 11: Testing and Bug fixing. Start integration with the Apertium general command.
 
Week 11: Testing and Bug fixing. Start integration with the Apertium general command.
Week 12: Integrate the fuzzy match repair tool as a part of the general apertium command
+
Week 12: Integrate the fuzzy match repair tool as a part of the general Apertium command
 
Project Completed
 
Project Completed
   

Revision as of 15:51, 20 March 2014

Name: Varun V Shenoy

E-mail address: shenoyvvarun@gmail.com

Other information that may be useful to contact you: phone number (+91)9738217698

URL To the Coding Challenge: https://github.com/shenoyvvarun/apertium-challenge/blob/master/challenge1.py

The tool, takes the target and source sentences using the output of morphological analyser and once we have the contractions expanded. Finds all pairs, translates all of them in a single shot using the apertium process and tries to find these translations in the input translated sentence.

Why is it you are interested in machine translation?

I came across a book "Le Ton Beau de Marot" by Douglas Hofstadter probably a year back. This book has inspired me in Linguistics and Machine Translation. I wanted to be more practical in Machine Translation by doing a project that would help me, in pursuing this as career in this field.

Why is it that they are interested in the Apertium project?

The heart of the organisation lies in the lesser-resourced and marginalised languages, and my mother tongue (Konkani) is such a language. I was impressed by Apertium due to this reason.I would definitely create a language pair for my mother tongue irrespective of what happens to this proposal.

Which of the published tasks are you interested in? What do you plan to do?

'Command-line translation memory fuzzy-match repair'

URL: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair


Proposal:

Title: Command-line translation memory fuzzy-match repair

Reasons why Google and Apertium should sponsor it?

Human translators use Machine translations to help them translate and the places where MT did not make a correct/apt translations, they manually fix it. If these fixes are stored, the next time, these fixes are made automatically. Now these stored Translations(In Translation Memory) could be used to exploit subsequent level repetitions. And now if the "to be" translated sentence is close to one of the sentences in the TMX[1] file, it would patch/repair the reference translation and get him the correct translation. These features are essential to a good Machine Translator. These are the reasons why Google and Apertium should sponsor it.

A description of how and who it will benefit in society.

Human translators who have previously used Apertium to translate documents and later fixed the incorrect translations will be benefited. Now, the human translator could add his translations to the TMX file and the next time if there is a sentence which is close to one of his fixed sentences in the TMX file, it would use that translation. Apertium is used here to provide resources for the fuzzy match repair rather than translating the phrases by the analogy translation principle with proper examples as its reference [2]. When a change is found between the reference and "to be" translated sentence, the clippings around the change can be translated using Apertium.

Detailed Work Plan.

Translation memory allows the human translator to store his higher quality translations into a TMX file[1] that can be used as cache for future translations. But this, doesn't exploit the sub-segment level repetitions(similar sentences) that happens in texts. A tool which does a repair on the sentence using the sub-segment level repetitions could improve the quality of translations. First, we find the fuzzy match score between the reference and the new source sentence. Candidates from the TMX file are chosen which have a fuzzy match score more than a threshold provided by the user. The best case would be the minimal pair, meaning that the sentences vary by just one element. Fuzzy match score would be calculated using (number of words that are same in place )/(total number of words). This measure would take an asymptotic time complexity of O(n^2) and can be easily found.

Instead of using Learning techniques [2] that statistically infer translations using Translation Memory. We could use a combination of TM (Translation Memory) and MT (Machine Translation). We could use MT to find additional resource for fuzzy-match repair. This would be efficient since MT translation system gives better results than going through the For this method to be successful the translator's own data is considered to be of higher quality. Now the translation is expected to better than MT for the entire segment and the normal fuzzy match. Now, we would require to cut the clippings around the changes. We will use a left to right scan to generate sub-segments and keep those segments which cover "changes". Now, we will use the Apertium tool to translate the clippings for us. We will use heuristics and discard many sub-segments, avoiding many incompatibilities. These filtered translated clippings as a whole contiguous block are looked for in the reference translated sentence, indices where they are found are noted, the ones not found are discarded. Now, the change is substituted in the clippings and these modified clippings are translated again using Apertium tool to form repair operators. Repair operators are substituted at the indices determined earlier to obtain t'. This method to find approximate translation of s is described in the wiki [4]


Implementation Details:

I plan to use python (I know Java and C++ and I am open to using them). First we have to parse and read information out of the tmx finding our best matches. I plan on using the lxml library which is pretty efficient and fast. Apertium already searches for matches in the tmx file. The source code of this feature will be used to fasten the development. Once candidates have been found I will be using the algorithm discussed above. To make the translation, I will be forking the Apertium process. I plan on minimising the number of fork calls made by making all the translations at one shot.


References

[1] http://en.wikipedia.org/wiki/Translation_Memory_eXchange [2] http://en.wikipedia.org/wiki/Example-based_machine_translation [3] Example based machine translation in the Pangloss System, http://acl.ldc.upenn.edu/C/C96/C96-1030.pdf [4] http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Command-line_translation_memory_fuzzy-match_repair


Weekly Plan:

Week 1 and 2: Time to think, research and familiarise my self with Apertium's feature which searches for matches in the TMX file and the Apertium tool.

Week 3: Search in the TMX file to find the best candidate.

Week 4: An algorithm to score and detect changes between two sentences.

Deliverable #1: A program that given a TMX file and sentence, finds the changes between reference sentence and given sentence.

Week 5:Implement an algorithm that gives out all the sub-segments that contains the changes and use the segments to come up with repair operators.

Week 6: Build the best match. Think of heuristics to avoid many incompatibilities(in repair operators).

Week 7: Code heuristics to avoid many incompatibilities in repair operator. Week 8: Obtain approximate translations by using the repair operators. Deliverable #2: A program that given TMX file and a sentence, gives approximate translation of the sentence. Week 9: Benchmark the quality of translations of the tool. We can hold out a part of TM which has a good fuzzy match score against the part of the TM which is available for the translations. Now these sentences will be used to generate translations using fuzzy match repair. We will now measure the quality of translations by comparing it with the reference translations already available. Week 10: Documentation, Quality Testing and Bug fixing Week 11: Testing and Bug fixing. Start integration with the Apertium general command. Week 12: Integrate the fuzzy match repair tool as a part of the general Apertium command Project Completed


Include time needed to think, to program, to document and to disseminate.

I would like to spend a lot of time to think but, at the same time I would want to get things started. So, I would estimate 2 weeks may be enough for research and thinking about the project.


I am currently a 4th and Final year student pursuing undergraduate studies in Computer Science. My area of interests lie in Linguistics, Machine translation and Automata theory.I have taken a special topic course on Linguistics. I have made a Language and a compiler which solves problems of deterministic finite automata. I have taken up several projects like Conversion of NFA to DFA, LL1 parser generator. Also I have completed the coding challenge which was assigned to this project. I am absolutely free for 12 weeks. I will be working from my house completely dedicated towards the project. I am not interning anywhere, neither do I have any classes to attend. I would like to spend at-least 50 hours per week for the project, to come up with a high quality, well-tested tool. I would spend at-least 8-9 hours for the project per day, and would I would like to email my mentor/ post on the wiki my progress report everyday.