Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Difference between revisions of "User:Khannatanmai/GSoC2020Proposal Trimming"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
'''Modifying Apertium Stream Format to include arbitrary information and eliminate monodix trimming'''
+
'''Modifying Apertium Stream Format to include arbitrary information and eliminating monodix trimming'''
   
== Personal Details ==
+
= Personal Details =
   
 
'''Name:''' Tanmai Khanna
 
'''Name:''' Tanmai Khanna
Line 33: Line 33:
 
I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.
 
I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.
   
=== Why is it that I am interested in Apertium and Machine Translation? ===
+
= Why is it that I am interested in Apertium and Machine Translation? =
   
 
Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.
 
Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.
Line 43: Line 43:
 
I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.
 
I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.
   
== Project Proposal ==
+
= Project Proposal =
=== Which of the published tasks am I interested in? What do I plan to do? ===
+
== Which of the published tasks am I interested in? What do I plan to do? ==
 
The task I'm interested in is '''Eliminating Dictionary Trimming'''([[Ideas_for_Google_Summer_of_Code/Eliminate_trimming]]). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.
 
The task I'm interested in is '''Eliminating Dictionary Trimming'''([[Ideas_for_Google_Summer_of_Code/Eliminate_trimming]]). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.
   
Line 51: Line 51:
 
As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming. In the next section I will be describing a proposed solution as well as how I plan to work around everything in [[Why_we_trim]].
 
As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming. In the next section I will be describing a proposed solution as well as how I plan to work around everything in [[Why_we_trim]].
   
=== Proposed Workaround ===
+
== Proposed Workaround ==
 
Several solutions are possible for avoiding trimming, some of which have been discussed by Unhammer [http://wiki.apertium.org/wiki/Talk:Why_we_trim here]. These involve keeping the surface form of the source word, and the lemma+analysis as well - use the analysis till you need it in the pipe and then propagate the source form as an unknown word (like it would be done in trimming). I have carefully evaluated these options and more and will discuss it here.
 
Several solutions are possible for avoiding trimming, some of which have been discussed by Unhammer [http://wiki.apertium.org/wiki/Talk:Why_we_trim here]. These involve keeping the surface form of the source word, and the lemma+analysis as well - use the analysis till you need it in the pipe and then propagate the source form as an unknown word (like it would be done in trimming). I have carefully evaluated these options and more and will discuss it here.
   
 
Since the primary reason for trimming the dictionary is that without it there are lots of untranslated lemmas in the output, the solution proposed here will avoid that without actually trimming. In fact, with this project I will try to work around everything in [[Why we trim]].
 
Since the primary reason for trimming the dictionary is that without it there are lots of untranslated lemmas in the output, the solution proposed here will avoid that without actually trimming. In fact, with this project I will try to work around everything in [[Why we trim]].
   
==== Propagating the surface form ====
+
=== Propagating the surface form ===
 
The solution that sounds the most viable is that instead of throwing away the surface form after the morph analysis, we should keep the surface form in the pipeline till the bidix lookup. '''If the word is not in the bidix, we can do one of two things:'''
 
The solution that sounds the most viable is that instead of throwing away the surface form after the morph analysis, we should keep the surface form in the pipeline till the bidix lookup. '''If the word is not in the bidix, we can do one of two things:'''
   
Line 65: Line 65:
 
It is worthwhile to discuss the possibility of propagating the surface form even past the bidix, and give the user an option to either just output the source surface form or source lemma + target morph based on their preference and task.
 
It is worthwhile to discuss the possibility of propagating the surface form even past the bidix, and give the user an option to either just output the source surface form or source lemma + target morph based on their preference and task.
   
==== Modification of words in transfer based on TL tags ====
+
=== Modification of words in transfer based on TL tags ===
 
Transfer rules quite often use target language information from bidix to fill in tags etc. We could just use the source tags as target tags, this could give a decent result, unless if it's a word that would have changed the gender/number while translation. This could be better than just treating it as an unknown word, but this would warrant a discussion with the community. If the source language doesn't have grammatical gender and the target language does, we could either choose to not translate its dependents, such as determiners, or we could translate them with a default gender.
 
Transfer rules quite often use target language information from bidix to fill in tags etc. We could just use the source tags as target tags, this could give a decent result, unless if it's a word that would have changed the gender/number while translation. This could be better than just treating it as an unknown word, but this would warrant a discussion with the community. If the source language doesn't have grammatical gender and the target language does, we could either choose to not translate its dependents, such as determiners, or we could translate them with a default gender.
   
 
Whichever we decide, would also help us decide whether to propagate the surface form only till the bidix lookup, or till generation.
 
Whichever we decide, would also help us decide whether to propagate the surface form only till the bidix lookup, or till generation.
   
==== Compounds and multiwords ====
+
=== Compounds and multiwords ===
 
Before apertium-separable, multiwords and compounds were split before bidix lookup into several units. This caused several problems as a partially unknown multiword would cause comprehensibility issues with the final output, and hence trimming prevents this by ensuring that a multiword or a compound doesn't stay in a monodix unless if it will be fully translated. This was one of the reasons why trimming was done. However, with apertium-separable, multiwords aren't split anymore, and trimming becomes detrimental to the translation of multiwords.
 
Before apertium-separable, multiwords and compounds were split before bidix lookup into several units. This caused several problems as a partially unknown multiword would cause comprehensibility issues with the final output, and hence trimming prevents this by ensuring that a multiword or a compound doesn't stay in a monodix unless if it will be fully translated. This was one of the reasons why trimming was done. However, with apertium-separable, multiwords aren't split anymore, and trimming becomes detrimental to the translation of multiwords.
   
Line 79: Line 79:
 
If the community feels like this is fine, then we don't really need to do much about multiwords and compounds when we remove trimming. If not, there can be ideas about storing the multiword surface forms.
 
If the community feels like this is fine, then we don't really need to do much about multiwords and compounds when we remove trimming. If not, there can be ideas about storing the multiword surface forms.
   
==== Implementation in the Apertium translation pipeline ====
+
=== Implementation in the Apertium translation pipeline ===
   
=== Idea: Morph Guessing for words missing in the bidix ===
+
== Idea: Morph Guessing for words missing in the bidix ==
 
Once trimming is eliminated, I have explained above how we can maintain the benefits of trimming by outputting the source word surface form as an unknown word. However, we can do even better than that by outputting the source lemma + target morph.
 
Once trimming is eliminated, I have explained above how we can maintain the benefits of trimming by outputting the source word surface form as an unknown word. However, we can do even better than that by outputting the source lemma + target morph.
   
Line 90: Line 90:
 
This would help with the post-editing and comprehensibility of the output. It is important to note that it is only viable once we eliminate trimming. The idea is that we propagate the source lemma/source surface form, and based on the source analysis, we guess the corresponding morph in the target language using the target monodix.
 
This would help with the post-editing and comprehensibility of the output. It is important to note that it is only viable once we eliminate trimming. The idea is that we propagate the source lemma/source surface form, and based on the source analysis, we guess the corresponding morph in the target language using the target monodix.
   
==== Possible implementation ====
+
=== Implementation of Morph Guessing ===
   
 
A naive way to implement this could be to use the pardefs in the TL monodix for an analysis to morph mapping, and for the ones where the pardefs aren't so clear, we could take all the words' surface forms with that analysis and try to find a common substring to isolate the morph. This could work decently for prefixing and suffixing languages. We could expand the morphological dictionary and use an algorithm such as OSTIA [1] to learn morphological analyses for word endings.
 
A naive way to implement this could be to use the pardefs in the TL monodix for an analysis to morph mapping, and for the ones where the pardefs aren't so clear, we could take all the words' surface forms with that analysis and try to find a common substring to isolate the morph. This could work decently for prefixing and suffixing languages. We could expand the morphological dictionary and use an algorithm such as OSTIA [1] to learn morphological analyses for word endings.
Line 104: Line 104:
 
This idea could of course be a project on its own, so I will treat this as a stretch goal for this GSoC, and try to set the foundations for it as part of this project.
 
This idea could of course be a project on its own, so I will treat this as a stretch goal for this GSoC, and try to set the foundations for it as part of this project.
   
=== About Dictionary Trimming ([http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Eliminate_trimming Coding challenge]) ===
+
== About Dictionary Trimming ([http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Eliminate_trimming Coding challenge]) ==
 
There are currently two ways of [[Automatically trimming a monodix]] - using lttoolbox or using hfst.
 
There are currently two ways of [[Automatically trimming a monodix]] - using lttoolbox or using hfst.
   
==== lttoolbox ====
+
=== lttoolbox ===
   
 
Lttoolbox has a command <code>lt-trim</code>. It takes as input a compiled monodix and the corresponding compiled bidix and gives a trimmed compiled monodix as output. Here is a small overview of how [[Lt-trim]] works:
 
Lttoolbox has a command <code>lt-trim</code>. It takes as input a compiled monodix and the corresponding compiled bidix and gives a trimmed compiled monodix as output. Here is a small overview of how [[Lt-trim]] works:
Line 119: Line 119:
 
For multiwords with <code>+</code>, one part is matched with the bidix, and then for each consequence part, the analysis is searched again from the start of the bidix, in effect searching each part in the word individually.
 
For multiwords with <code>+</code>, one part is matched with the bidix, and then for each consequence part, the analysis is searched again from the start of the bidix, in effect searching each part in the word individually.
   
==== hsft ====
+
=== hsft ===
 
hfst is the Helsinki finite-state toolkit, which is used to build finite state transducers for processing morphologies. Since several language pairs, such as apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat, use hfst instead of lttoolbox for the morph analysis, the trimming needs to be done in hfst as well. The following is a snippet of code which shows how trimming is done using hfst.
 
hfst is the Helsinki finite-state toolkit, which is used to build finite state transducers for processing morphologies. Since several language pairs, such as apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat, use hfst instead of lttoolbox for the morph analysis, the trimming needs to be done in hfst as well. The following is a snippet of code which shows how trimming is done using hfst.
   
Line 134: Line 134:
 
Then the prefixes fst and the inverted monodix fst is intersected(only strings which will be accepted by both of them), and a new fst is composed. This is finally inverted to again have the surface forms as input and lemma+analysis as output. <code>test-en.trimmed.fst</code> is the final trimmed morph analyser.
 
Then the prefixes fst and the inverted monodix fst is intersected(only strings which will be accepted by both of them), and a new fst is composed. This is finally inverted to again have the surface forms as input and lemma+analysis as output. <code>test-en.trimmed.fst</code> is the final trimmed morph analyser.
   
==== Trimming in apertium-sme-nob ====
+
=== Trimming in apertium-sme-nob ===
 
Trimming in sme-nob is a tad bit different from the implementation described above. <code>apertium-sme-nob</code> has compounds which have words separated by '+'. As mentioned earlier, these are split before a bidix lookup, so the bidix will not have any multiwords (with a +).
 
Trimming in sme-nob is a tad bit different from the implementation described above. <code>apertium-sme-nob</code> has compounds which have words separated by '+'. As mentioned earlier, these are split before a bidix lookup, so the bidix will not have any multiwords (with a +).
   
Line 193: Line 193:
 
'''Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix).''' Then, using <code>hfst-fst2fst</code> it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named <code>$(PREFIX1).automorf.hfst</code>. The unmodified automorf is put in the same optimised-lookup implementation and named <code>$(PREFIX1).automorf-untrimmed.hfst</code>. Note that <code>PREFIX1=$(LANG1)-$(LANG2)</code>.
 
'''Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix).''' Then, using <code>hfst-fst2fst</code> it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named <code>$(PREFIX1).automorf.hfst</code>. The unmodified automorf is put in the same optimised-lookup implementation and named <code>$(PREFIX1).automorf-untrimmed.hfst</code>. Note that <code>PREFIX1=$(LANG1)-$(LANG2)</code>.
   
== Work Plan (TODO) ==
+
= Work Plan (TODO) =
   
 
'''Community Bonding Period''' (May 6 - May 27)
 
'''Community Bonding Period''' (May 6 - May 27)
Line 201: Line 201:
 
*
 
*
   
=== Deliverable #1: ===
+
== Deliverable #1: ==
   
 
'''Evaluation 1: June 24-28'''
 
'''Evaluation 1: June 24-28'''
Line 208: Line 208:
 
*
 
*
   
=== Deliverable #2: Trimming eliminated without regression of benefits ===
+
== Deliverable #2: Trimming eliminated without regression of benefits ==
   
 
'''Evaluation 2: July 22-26'''
 
'''Evaluation 2: July 22-26'''
Line 219: Line 219:
 
'''Final Evaluations: August 19-26'''
 
'''Final Evaluations: August 19-26'''
   
=== Project Completed ===
+
== Project Completed ==
 
'''NOTE''': The third phase of the project has extra time to deal with unforeseen issues and ideas
 
'''NOTE''': The third phase of the project has extra time to deal with unforeseen issues and ideas
 
----
 
----
   
== A description of how and who it will benefit in society ==
+
= A description of how and who it will benefit in society =
   
 
It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. By discarding the morphological analysis of the word too early we prevent modules like lexical selection and transfer from using it, which would really benefit from this information. Secondly, as I've described earlier, there are several projects and ideas that would be possible once trimming is eliminated, because now the developer will have the option to use this morphological information anywhere in the pipeline. The morph guessing project will make the output of translation from Apertium much for comprehensive and would help with both gisting translation and post-editing, and hence help all kinds of users of Apertium.
 
It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. By discarding the morphological analysis of the word too early we prevent modules like lexical selection and transfer from using it, which would really benefit from this information. Secondly, as I've described earlier, there are several projects and ideas that would be possible once trimming is eliminated, because now the developer will have the option to use this morphological information anywhere in the pipeline. The morph guessing project will make the output of translation from Apertium much for comprehensive and would help with both gisting translation and post-editing, and hence help all kinds of users of Apertium.
Line 229: Line 229:
 
I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.
 
I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.
   
== Reasons why Google and Apertium should sponsor it ==
+
= Reasons why Google and Apertium should sponsor it =
   
 
I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.
 
I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.
Line 235: Line 235:
 
By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.
 
By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.
   
== Skills and Qualifications ==
+
= Skills and Qualifications =
 
I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.
 
I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.
   
Line 248: Line 248:
 
The details of my skills and work experience can be found here: [https://drive.google.com/file/d/1KGYAyH2yj4ibk5eBW0uUoeU0sjT2LrRG/view?usp=sharing CV]
 
The details of my skills and work experience can be found here: [https://drive.google.com/file/d/1KGYAyH2yj4ibk5eBW0uUoeU0sjT2LrRG/view?usp=sharing CV]
   
== Non-Summer-Of-Code Plans ==
+
= Non-Summer-Of-Code Plans =
   
 
I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.
 
I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.

Revision as of 08:10, 29 March 2020

Modifying Apertium Stream Format to include arbitrary information and eliminating monodix trimming

Personal Details

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (4th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m an undergraduate researcher in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

Computational Linguistics is my passion, and I would love to work with similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

Why is it that I am interested in Apertium and Machine Translation?

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages.

A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.

I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

The task I'm interested in is Eliminating Dictionary Trimming(Ideas_for_Google_Summer_of_Code/Eliminate_trimming). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.

However, by trimming the dictionary, you throw away valuable analyses of words in the source language, which, if preserved, can be used as context for lexical selection and analysis of the input. Also, several transfer rules don't match as the word is shown as unknown. There are several other project ideas that would become viable if we don't trim away the analyses of monodix words that aren't in the bidix, such as a morph-guessing module that can output a source language lemma with target language morphology. Eliminating trimming would also help as if we learn morphological data for a language from an external source, if we continue to trim, it is unusable until all those words are in the bidix. As a general principle, it is better to try and not discard useful information, to pass it in the pipeline and then work on it based on the task.

As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming. In the next section I will be describing a proposed solution as well as how I plan to work around everything in Why_we_trim.

Proposed Workaround

Several solutions are possible for avoiding trimming, some of which have been discussed by Unhammer here. These involve keeping the surface form of the source word, and the lemma+analysis as well - use the analysis till you need it in the pipe and then propagate the source form as an unknown word (like it would be done in trimming). I have carefully evaluated these options and more and will discuss it here.

Since the primary reason for trimming the dictionary is that without it there are lots of untranslated lemmas in the output, the solution proposed here will avoid that without actually trimming. In fact, with this project I will try to work around everything in Why we trim.

Propagating the surface form

The solution that sounds the most viable is that instead of throwing away the surface form after the morph analysis, we should keep the surface form in the pipeline till the bidix lookup. If the word is not in the bidix, we can do one of two things:

  • We treat it as an unknown and pass on just the surface form from then on. The final output would then be the source language surface form with a *, as it happens with trimming. This would give us the benefits of trimming, and the source analysis would be useful for lexical selection. However, transfer rules would still suffer as there would be no analysis in the lexical unit.
  • A more interesting solution would be to propagate the surface lemma and the surface analysis and pretend that it's actually a target lexical unit. This would be useful in transfer rules and instead of just outputting the source surface form, we can output the source lemma + target morph. This would improve the post-editability as well as the comprehensibility of the output. This would involve creating a morph guessing module which will be discussed later.

It is worthwhile to discuss the possibility of propagating the surface form even past the bidix, and give the user an option to either just output the source surface form or source lemma + target morph based on their preference and task.

Modification of words in transfer based on TL tags

Transfer rules quite often use target language information from bidix to fill in tags etc. We could just use the source tags as target tags, this could give a decent result, unless if it's a word that would have changed the gender/number while translation. This could be better than just treating it as an unknown word, but this would warrant a discussion with the community. If the source language doesn't have grammatical gender and the target language does, we could either choose to not translate its dependents, such as determiners, or we could translate them with a default gender.

Whichever we decide, would also help us decide whether to propagate the surface form only till the bidix lookup, or till generation.

Compounds and multiwords

Before apertium-separable, multiwords and compounds were split before bidix lookup into several units. This caused several problems as a partially unknown multiword would cause comprehensibility issues with the final output, and hence trimming prevents this by ensuring that a multiword or a compound doesn't stay in a monodix unless if it will be fully translated. This was one of the reasons why trimming was done. However, with apertium-separable, multiwords aren't split anymore, and trimming becomes detrimental to the translation of multiwords.

If XY is a multiword in the monodix, the trimming algorithm checks if both X and Y are in the bidix, and if they're not, it trims the monodix. However, with separable, the bidix might have XY as one unit even if one of X or Y isn't there, but the trimming algorithm would still trim away the multiword from the monodix. This warrants the elimination of trimming.

Compounds are still split in pretransfer, so the issue there would be that if a compound has one part in the bidix and not the other, the final translation would be odd. To me it seems like since the parts of a compound were going to be translated individually anyway, translating part of it shouldn't be an issue, but should actually help with post-editing.

If the community feels like this is fine, then we don't really need to do much about multiwords and compounds when we remove trimming. If not, there can be ideas about storing the multiword surface forms.

Implementation in the Apertium translation pipeline

Idea: Morph Guessing for words missing in the bidix

Once trimming is eliminated, I have explained above how we can maintain the benefits of trimming by outputting the source word surface form as an unknown word. However, we can do even better than that by outputting the source lemma + target morph.

For example, Translating from Basque to English:

"Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni *izeki-ed the sheets".

This would help with the post-editing and comprehensibility of the output. It is important to note that it is only viable once we eliminate trimming. The idea is that we propagate the source lemma/source surface form, and based on the source analysis, we guess the corresponding morph in the target language using the target monodix.

Implementation of Morph Guessing

A naive way to implement this could be to use the pardefs in the TL monodix for an analysis to morph mapping, and for the ones where the pardefs aren't so clear, we could take all the words' surface forms with that analysis and try to find a common substring to isolate the morph. This could work decently for prefixing and suffixing languages. We could expand the morphological dictionary and use an algorithm such as OSTIA [1] to learn morphological analyses for word endings.

Example pardef:

<pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
</pardef>

This idea could of course be a project on its own, so I will treat this as a stretch goal for this GSoC, and try to set the foundations for it as part of this project.

About Dictionary Trimming (Coding challenge)

There are currently two ways of Automatically trimming a monodix - using lttoolbox or using hfst.

lttoolbox

Lttoolbox has a command lt-trim. It takes as input a compiled monodix and the corresponding compiled bidix and gives a trimmed compiled monodix as output. Here is a small overview of how Lt-trim works:

It loads the analyser and bidix, and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the trimmed analyser. So, only those analyses that would pass through bidix when doing lt-proc -b will stay in the final trimmed analyser.

The bidix is preprocessed to match the format of the monodix. This involves doing a union of all sections into one big section. Then an effective .* appended in the bidix entries so that if "foo<vblex>" is in there, it will match with "foo<vblex><pres>” in the monodix. Lastly, it also moves all lemq's (the # or <g> group elements) to after the tags in the bidix entries as the monodix always has the # part after tags, while bidix has them on the lemma.

Once this is done, the intersect takes place. It also deals with #-type multiwords, i.e. multiwords that have an invariable part, by changing the format of a multiword in the bidix take# out<vblex> to the format of a multiword in the monodix take<vblex># out. For multiwords with +, one part is matched with the bidix, and then for each consequence part, the analysis is searched again from the start of the bidix, in effect searching each part in the word individually.

hsft

hfst is the Helsinki finite-state toolkit, which is used to build finite state transducers for processing morphologies. Since several language pairs, such as apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat, use hfst instead of lttoolbox for the morph analysis, the trimming needs to be done in hfst as well. The following is a snippet of code which shows how trimming is done using hfst.

hfst-invert test-en.fst -o test-en.mor.fst
hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
echo " ?* " | hfst-regexp2fst > any.fst
hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst

test-en.fst is the compiled monolingual dictionary (input: source surface form, output: source lemma + analysis) and test-en-eu.fst is the compiled bilingual dictionary (input: source lemma + analysis, output: target lemma + analysis). The monodix fst is inverted (input and output labels exchanged), and the bidix fst is projected to create a transducer of just the input strings. Then, the any.fst is concatenated to this projected fst so that it accepts foo<n><sg> even if the bidix only has foo<n>. This is called the test-en-eu.en-prefixes.fst.

Then the prefixes fst and the inverted monodix fst is intersected(only strings which will be accepted by both of them), and a new fst is composed. This is finally inverted to again have the surface forms as input and lemma+analysis as output. test-en.trimmed.fst is the final trimmed morph analyser.

Trimming in apertium-sme-nob

Trimming in sme-nob is a tad bit different from the implementation described above. apertium-sme-nob has compounds which have words separated by '+'. As mentioned earlier, these are split before a bidix lookup, so the bidix will not have any multiwords (with a +).

To deal with this, we modify the bidix such that it will match with normal words as discussed earlier, but after that it adds an optional + and then the entire bidix match again X2. In terms of regex, this looks something like bidix [^+]* (+ bidix [^+]*){0,2}. In effect, this means that it will later match a monodix multiword with at most two +s with the words in the bidix by matching part of the multiword with each word in the bidix. Here is the recipe for this process, as found in the Makefile.

# Override prefixes from ap_include, since we need the derivation-pos-changes:
.deps/%.autobil.prefixes: %.autobil.bin .deps/.d
	lt-print $< | sed 's/ /@_SPACE_@/g' > .deps/$*.autobil.att
	hfst-txt2fst -e ε -i  .deps/$*.autobil.att -o .deps/$*.autobil-split.hfst
	hfst-head      -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-head.hfst
	hfst-tail -n+2 -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-tail.hfst
	hfst-union -2 .deps/$*.autobil-head.hfst -1 .deps/$*.autobil-tail.hfst -o .deps/$*.autobil.hfst
	hfst-project -p upper .deps/$*.autobil.hfst -o .deps/$*.autobil.upper                                   # bidix
	echo '[ "<n>" -> [ "<n>" | "<ex_n>" ] ] .o. [ "<adj>" -> [ "<adj>" | "<ex_adj>" ] ] .o. [ "<vblex>" -> [ "<vblex>" |"<ex_vblex>" ] ] .o. [ "<iv>" -> [ "<iv>" | "<ex_iv>" ] ] .o. [ "<tv>" -> [ "<tv>" | "<ex_tv>" ] ]' \
		| hfst-regexp2fst -o .deps/$*.derivpos.hfst
	hfst-compose -1 .deps/$*.autobil.upper -2 .deps/$*.derivpos.hfst -o .deps/$*.autobil-derivpos.hfst
	hfst-project -p lower .deps/$*.autobil-derivpos.hfst -o .deps/$*.autobil-derivpos.hfsa                  # bidix with n -> n|ex_n
	echo ' [ ? - %+ ]* ' | hfst-regexp2fst > .deps/$*.any-nonplus.hfst                                                        # [^+]*
	hfst-concatenate -1 .deps/$*.autobil-derivpos.hfsa -2 .deps/$*.any-nonplus.hfst -o .deps/$*.autobil.nonplussed    # bidix [^+]*
	echo ' %+ ' | hfst-regexp2fst > .deps/$*.single-plus.hfst                                                                 # +
	hfst-concatenate -1 .deps/$*.single-plus.hfst -2 .deps/$*.autobil.nonplussed -o .deps/$*.autobil.postplus # + bidix [^+]*
	hfst-repeat -f0 -t2 -i .deps/$*.autobil.postplus -o .deps/$*.autobil.postplus.0,2                      # (+ bidix [^+]*){0,2} -- gives at most two +
	hfst-concatenate -1 .deps/$*.autobil.nonplussed -2 .deps/$*.autobil.postplus.0,2 -o $@                 # bidix [^+]* (+ bidix [^+]*){0,2}

Once the bidix is modified to deal with compounds, we move on to the trimming. Since makefiles don't execute in order, here I have rearranged the order of the statements in this Makefile so that they're in the order of execution.

# -------------------
# Northern Saami analysis:
# -------------------

.deps/$(LANG1).automorf.hfst: $(AP_SRC1)/apertium-und.$(LANG1)-und.LR.att.gz .deps/.d
	$(ZCAT) $< | hfst-txt2fst > $@

.deps/rm-deriv-cmp.hfst: rm-deriv-cmp.twol .deps/.d
	hfst-twolc -i $< -o $@

.deps/$(LANG1).automorf-rmderiv.hfst:           .deps/$(LANG1).automorf.hfst          .deps/rm-deriv-cmp.hfst
	hfst-compose-intersect -1 $< -2 .deps/rm-deriv-cmp.hfst -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst: .deps/$(LANG1).automorf-rmderiv.hfst .deps/$(PREFIX1).autobil.prefixes
	hfst-compose-intersect -1 $< -2 .deps/$(PREFIX1).autobil.prefixes -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst
	hfst-minimize -i $< -o $@

$(PREFIX1).automorf.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst
	hfst-fst2fst -w -i $< -o $@

$(PREFIX1).automorf-untrimmed.hfst: .deps/$(LANG1).automorf.hfst
	hfst-fst2fst -w -i $< -o $@

In the first three statements, the twol file is compiled and then composed with the lexc compiled file to make the overall morph analyser, as explained in Starting a new language with HFST.

Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix). Then, using hfst-fst2fst it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named $(PREFIX1).automorf.hfst. The unmodified automorf is put in the same optimised-lookup implementation and named $(PREFIX1).automorf-untrimmed.hfst. Note that PREFIX1=$(LANG1)-$(LANG2).

Work Plan (TODO)

Community Bonding Period (May 6 - May 27)

Week 1-4 (May 27 - )

Deliverable #1:

Evaluation 1: June 24-28

Week 5-8 (June 28)

Deliverable #2: Trimming eliminated without regression of benefits

Evaluation 2: July 22-26

Week 9-12 (July 26)


Final Evaluations: August 19-26

Project Completed

NOTE: The third phase of the project has extra time to deal with unforeseen issues and ideas


A description of how and who it will benefit in society

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. By discarding the morphological analysis of the word too early we prevent modules like lexical selection and transfer from using it, which would really benefit from this information. Secondly, as I've described earlier, there are several projects and ideas that would be possible once trimming is eliminated, because now the developer will have the option to use this morphological information anywhere in the pipeline. The morph guessing project will make the output of translation from Apertium much for comprehensive and would help with both gisting translation and post-editing, and hence help all kinds of users of Apertium.

I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

Reasons why Google and Apertium should sponsor it

I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications

I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Finite State Transducers, Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying data which I feel is essential in solving any problem.

I've worked with Apertium as part of GSoC 2019 and built the Anaphora resolution module, and hence I'm familiar with the codebase and the community which will help me to dive right in the project and make a significant contribution right from the start. I have worked in several other projects, such as a tool that predicts commas and sentence boundaries in ASR output using pitch, building a Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Non-Summer-Of-Code Plans

I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.