Difference between revisions of "User:Khannatanmai/GSoC2020Proposal Trimming"

From Apertium
Jump to navigation Jump to search
Line 149: Line 149:
In the first three statements, the twol file is compiled and then composed with the lexc compiled file to make the overall morph analyser, as explained in [[Starting_a_new_language_with_HFST]].
In the first three statements, the twol file is compiled and then composed with the lexc compiled file to make the overall morph analyser, as explained in [[Starting a new language with HFST]].
'''Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix).''' Then, using <code>hfst-fst2fst</code> it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named <code>$(PREFIX1).automorf.hfst</code>. The unmodified automorf is put in the same optimised-lookup implementation and named <code>$(PREFIX1).automorf-untrimmed.hfst</code>. Note that <code>PREFIX1=$(LANG1)-$(LANG2)</code>.
'''Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix).''' Then, using <code>hfst-fst2fst</code> it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named <code>$(PREFIX1).automorf.hfst</code>. The unmodified automorf is put in the same optimised-lookup implementation and named <code>$(PREFIX1).automorf-untrimmed.hfst</code>. Note that <code>PREFIX1=$(LANG1)-$(LANG2)</code>.

Revision as of 19:21, 27 March 2020

Personal Details

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

GitHub: khannatanmai

LinkedIn: khannatanmai

Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (4th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m an undergraduate researcher in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

Computational Linguistics is my passion, and I would love to work with similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

Why is it that I am interested in Apertium and Machine Translation?

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages.

A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.

I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

The task I'm interested in is Eliminating Dictionary Trimming(Ideas_for_Google_Summer_of_Code/Eliminate_trimming). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.

However, by trimming the dictionary, you throw away valuable analyses of words in the source language, which, if preserved, can be used as context for lexical selection and analysis of the input. Also, several transfer rules don't match as the word is shown as unknown. There are several other project ideas that would become viable if we don't trim away the analyses of monodix words that aren't in the bidix, such as a morph-guessing module that can output a source language lemma with target language morphology. Eliminating trimming would also help as if we learn morphological data for a language from an external source, if we continue to trim, it is unusable until all those words are in the bidix. As a general principle, it is better to try and not discard useful information, to pass it in the pipeline and then work on it based on the task.

As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming. In the next section I will be describing a proposed solution as well as how I plan to work around everything in Why_we_trim.

Proposed Improvements

Several solutions are possible for avoiding trimming, some of which have been discussed by Unhammer here. [ADD LINK HERE] These involve keeping the surface form of the source word, and the lemma+analysis as well - use the analysis till you need it in the pipe and then propagate the source form as an unknown word (like it would be done in trimming).

Propagating the surface form

Modification of words in transfer based on TL tags

Compounds and multiwords

Idea Description

Working Example

Morph Guessing for unknown words

About Dictionary Trimming (Coding Challenge)

There are currently two ways of Automatically trimming a monodix - using lttoolbox or using hfst.


Lttoolbox has a command lt-trim. It takes as input a compiled monodix and the corresponding compiled bidix and gives a trimmed compiled monodix as output. Here is a small overview of how Lt-trim works:

It loads the analyser and bidix, and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the trimmed analyser. So, only those analyses that would pass through bidix when doing lt-proc -b will stay in the final trimmed analyser.

The bidix is preprocessed to match the format of the monodix. This involves doing a union of all sections into one big section. Then an effective .* appended in the bidix entries so that if "foo<vblex>" is in there, it will match with "foo<vblex><pres>” in the monodix. Lastly, it also moves all lemq's (the # or <g> group elements) to after the tags in the bidix entries as the monodix always has the # part after tags, while bidix has them on the lemma.

Once this is done, the intersect takes place. It also deals with #-type multiwords, i.e. multiwords that have an invariable part, by changing the format of a multiword in the bidix take# out<vblex> to the format of a multiword in the monodix take<vblex># out. For multiwords with +, one part is matched with the bidix, and then for each consequence part, the analysis is searched again from the start of the bidix, in effect searching each part in the word individually.


hfst is the Helsinki finite-state toolkit, which is used to build finite state transducers for processing morphologies. Since several language pairs, such as apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat, use hfst instead of lttoolbox for the morph analysis, the trimming needs to be done in hfst as well. The following is a snippet of code which shows how trimming is done using hfst.

hfst-invert test-en.fst -o test-en.mor.fst
hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
echo " ?* " | hfst-regexp2fst > any.fst
hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst

test-en.fst is the compiled monolingual dictionary (input: source surface form, output: source lemma + analysis) and test-en-eu.fst is the compiled bilingual dictionary (input: source lemma + analysis, output: target lemma + analysis). The monodix fst is inverted (input and output labels exchanged), and the bidix fst is projected to create a transducer of just the input strings. Then, the any.fst is concatenated to this projected fst so that it accepts foo<n><sg> even if the bidix only has foo<n>. This is called the test-en-eu.en-prefixes.fst.

Then the prefixes fst and the inverted monodix fst is intersected(only strings which will be accepted by both of them), and a new fst is composed. This is finally inverted to again have the surface forms as input and lemma+analysis as output. test-en.trimmed.fst is the final trimmed morph analyser.

Trimming in apertium-sme-nob

Trimming in sme-nob is a tad bit different from the implementation described above. apertium-sme-nob has compounds which have words separated by '+'. As mentioned earlier, these are split before a bidix lookup, so the bidix will not have any multiwords (with a +).

To deal with this, we modify the bidix such that it will match with normal words as discussed earlier, but after that it adds an optional + and then the entire bidix match again X2. In terms of regex, this looks something like bidix [^+]* (+ bidix [^+]*){0,2}. In effect, this means that it will later match a monodix multiword with at most two +s with the words in the bidix by matching part of the multiword with each word in the bidix. Here is the recipe for this process, as found in the Makefile.

# Override prefixes from ap_include, since we need the derivation-pos-changes:
.deps/%.autobil.prefixes: %.autobil.bin .deps/.d
	lt-print $< | sed 's/ /@_SPACE_@/g' > .deps/$*.autobil.att
	hfst-txt2fst -e ε -i  .deps/$*.autobil.att -o .deps/$*.autobil-split.hfst
	hfst-head      -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-head.hfst
	hfst-tail -n+2 -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-tail.hfst
	hfst-union -2 .deps/$*.autobil-head.hfst -1 .deps/$*.autobil-tail.hfst -o .deps/$*.autobil.hfst
	hfst-project -p upper .deps/$*.autobil.hfst -o .deps/$*.autobil.upper                                   # bidix
	echo '[ "<n>" -> [ "<n>" | "<ex_n>" ] ] .o. [ "<adj>" -> [ "<adj>" | "<ex_adj>" ] ] .o. [ "<vblex>" -> [ "<vblex>" |"<ex_vblex>" ] ] .o. [ "<iv>" -> [ "<iv>" | "<ex_iv>" ] ] .o. [ "<tv>" -> [ "<tv>" | "<ex_tv>" ] ]' \
		| hfst-regexp2fst -o .deps/$*.derivpos.hfst
	hfst-compose -1 .deps/$*.autobil.upper -2 .deps/$*.derivpos.hfst -o .deps/$*.autobil-derivpos.hfst
	hfst-project -p lower .deps/$*.autobil-derivpos.hfst -o .deps/$*.autobil-derivpos.hfsa                  # bidix with n -> n|ex_n
	echo ' [ ? - %+ ]* ' | hfst-regexp2fst > .deps/$*.any-nonplus.hfst                                                        # [^+]*
	hfst-concatenate -1 .deps/$*.autobil-derivpos.hfsa -2 .deps/$*.any-nonplus.hfst -o .deps/$*.autobil.nonplussed    # bidix [^+]*
	echo ' %+ ' | hfst-regexp2fst > .deps/$*.single-plus.hfst                                                                 # +
	hfst-concatenate -1 .deps/$*.single-plus.hfst -2 .deps/$*.autobil.nonplussed -o .deps/$*.autobil.postplus # + bidix [^+]*
	hfst-repeat -f0 -t2 -i .deps/$*.autobil.postplus -o .deps/$*.autobil.postplus.0,2                      # (+ bidix [^+]*){0,2} -- gives at most two +
	hfst-concatenate -1 .deps/$*.autobil.nonplussed -2 .deps/$*.autobil.postplus.0,2 -o $@                 # bidix [^+]* (+ bidix [^+]*){0,2}

Once the bidix is modified to deal with compounds, we move on to the trimming. Since makefiles don't execute in order, here I have rearranged the order of the statements in this Makefile so that they're in the order of execution.

# -------------------
# Northern Saami analysis:
# -------------------

.deps/$(LANG1).automorf.hfst: $(AP_SRC1)/apertium-und.$(LANG1)-und.LR.att.gz .deps/.d
	$(ZCAT) $< | hfst-txt2fst > $@

.deps/rm-deriv-cmp.hfst: rm-deriv-cmp.twol .deps/.d
	hfst-twolc -i $< -o $@

.deps/$(LANG1).automorf-rmderiv.hfst:           .deps/$(LANG1).automorf.hfst          .deps/rm-deriv-cmp.hfst
	hfst-compose-intersect -1 $< -2 .deps/rm-deriv-cmp.hfst -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst: .deps/$(LANG1).automorf-rmderiv.hfst .deps/$(PREFIX1).autobil.prefixes
	hfst-compose-intersect -1 $< -2 .deps/$(PREFIX1).autobil.prefixes -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst
	hfst-minimize -i $< -o $@

$(PREFIX1).automorf.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst
	hfst-fst2fst -w -i $< -o $@

$(PREFIX1).automorf-untrimmed.hfst: .deps/$(LANG1).automorf.hfst
	hfst-fst2fst -w -i $< -o $@

In the first three statements, the twol file is compiled and then composed with the lexc compiled file to make the overall morph analyser, as explained in Starting a new language with HFST.

Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix). Then, using hfst-fst2fst it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named $(PREFIX1).automorf.hfst. The unmodified automorf is put in the same optimised-lookup implementation and named $(PREFIX1).automorf-untrimmed.hfst. Note that PREFIX1=$(LANG1)-$(LANG2).

Work Plan

Community Bonding Period (May 6 - May 27)

Week 1-4 (May 27 - )

Deliverable #1:

Evaluation 1: June 24-28

Week 5-8 (June 28)

Deliverable #2: Trimming eliminated without regression of benefits

Evaluation 2: July 22-26

Week 9-12 (July 26)

Final Evaluations: August 19-26

Project Completed

NOTE: The third phase of the project has extra time to deal with unforeseen issues and ideas

A description of how and who it will benefit in society

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. By discarding the morphological analysis of the word too early we prevent modules like lexical selection and transfer from using it, which would really benefit from this information. Secondly, as I've described earlier, there are several projects and ideas that would be possible once trimming is eliminated, because now the developer will have the option to use this morphological information anywhere in the pipeline. The morph guessing project will make the output of translation from Apertium much for comprehensive and would help with both gisting translation and post-editing, and hence help all kinds of users of Apertium.

I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

Reasons why Google and Apertium should sponsor it

I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications

I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Finite State Transducers, Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying data which I feel is essential in solving any problem.

I've worked with Apertium as part of GSoC 2019 and built the Anaphora resolution module, and hence I'm familiar with the codebase and the community which will help me to dive right in the project and make a significant contribution right from the start. I have worked in several other projects, such as a tool that predicts commas and sentence boundaries in ASR output using pitch, building a Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Non-Summer-Of-Code Plans

I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.