Revision as of 21:22, 29 March 2020

Modifying Apertium Stream Format to include arbitrary information and eliminating monodix trimming

Personal Details

Name: Tanmai Khanna

E-mail address: khanna.tanmai@gmail.com , tanmai.khanna@research.iiit.ac.in

IRC: khannatanmai

Current Designation: Undergraduate Researcher in the LTRC Lab, IIIT Hyderabad (4th year student) and a Teaching Assistant for Linguistics courses

Time Zone: GMT+5:30

About Me

Open Source Softwares I use: I have used Apertium in the past, Ubuntu, Firefox, VLC.

Professional Interests: I’m an undergraduate researcher in Computational Linguistics and I have a particular interest in Linguistics and NLP tools, specifically Machine Translation and its components.

Hobbies: I love Parliamentary Debating, Singing, and Reading.

What I want to get out of GSoC

I’ve enjoyed using Apertium in various personal and academic projects and it’s amazing to me that I get an opportunity to work with them.

Computational Linguistics is my passion, and I would love to work with similarly passionate people at Apertium, to develop tools that people actually benefit from. This would be an invaluable experience that classes just can't match.

I am applying for GSoC, as the stipend would allow me to dedicate my full attention to the project during the 3 months.

Why is it that I am interested in Apertium and Machine Translation?

Apertium is an Open Source Rule-based MT system. I'm a researcher in the IIIT-H LTRC lab, currently working on Machine Translation and it interests me because it’s a complex problem which tries to achieve something most people believe is only achievable by humans. Translating data to other languages, and especially low- resource languages gives the speakers of those languages access to valuable data and can help in several domains, such as education, news, judiciary, etc. Machine Translation is often called NLP-Complete by my professors, i.e. it uses most of the tools NLP has to offer and hence if one learns to create good tools for MT, they learn most of Natural Language Processing.

Each part of Apertium's mission statement, especially the fact that they focus on Low Resource Languages, excites me to be working with them. While recent trends lean towards Neural Networks and Deep Learning, they fall short when it comes to resource-poor languages.

A tool which is rule-based and open source really helps the community with language pairs that are resource- poor and gives them free translations for their needs and that is why I want to work on improving on it.

I've worked with Apertium for GSoC 2019 and have continued to update and maintain the Anaphora Resolution module that I developed. I have also contributed to a paper written about the recent advances in Apertium. I have enjoyed every bit of the process and since I plan to be a long time contributor with Apertium, I'm applying for this project, that eliminates dictionary trimming and would help the users of this tool and help me develop a deep knowledge about the Apertium pipeline, which will help for all the future projects I do in Apertium as well.

Project Proposal

Which of the published tasks am I interested in? What do I plan to do?

The task I'm interested in is Eliminating Dictionary Trimming(Ideas_for_Google_Summer_of_Code/Eliminate_trimming). Dictionary trimming is the process of removing those words and their analyses from monolingual language models (FSTs compiled from monodixes) which don't have an entry in the bidix, to avoid a lot of untranslated lemmas (with an @ if debugging) in the output, which lead to issues with comprehension and post-editing the output. The section #WhyWeTrim explains the rationale behind trimming further.

However, by trimming the dictionary, you throw away valuable analyses of words in the source language, which, if preserved, can be used as context for lexical selection and analysis of the input. Also, several transfer rules don't match as the word is shown as unknown. There are several other project ideas that would become viable if we don't trim away the analyses of monodix words that aren't in the bidix, such as a morph-guessing module that can output a source language lemma with target language morphology. Eliminating trimming would also help as if we learn morphological data for a language from an external source, if we continue to trim, it is unusable until all those words are in the bidix. As a general principle, it is better to try and not discard useful information, to pass it in the pipeline and then work on it based on the task.

As part of this project, I plan to eliminate monodix trimming and propose a solution such that we don't lose the benefits of trimming. In the next section I will be describing a proposed solution as well as how I plan to work around everything in Why_we_trim.

Proposed Workaround

Several solutions are possible for avoiding trimming, some of which have been discussed by Unhammer here. These involve keeping the surface form of the source word, and the lemma+analysis as well - use the analysis till you need it in the pipe and then propagate the source form as an unknown word (like it would be done in trimming). I have carefully evaluated these options and more and will discuss it here.

Since the primary reason for trimming the dictionary is that without it there are lots of untranslated lemmas in the output, the solution proposed here will avoid that without actually trimming. In fact, with this project I will try to work around everything in Why we trim.

Propagating the surface form

The solution that sounds the most viable is that instead of throwing away the surface form after the morph analysis, we should keep the surface form in the pipeline till the bidix lookup. If the word is not in the bidix, we can do one of two things:

We treat it as an unknown and pass on just the surface form from then on. The final output would then be the source language surface form with a *, as it happens with trimming. This would give us the benefits of trimming, and the source analysis would be useful for lexical selection. However, transfer rules would still suffer as there would be no analysis in the lexical unit.

A more interesting solution would be to propagate the surface lemma and the surface analysis and pretend that it's actually a target lexical unit. This would be useful in transfer rules and instead of just outputting the source surface form, we can output the source lemma + target morph. This would improve the post-editability as well as the comprehensibility of the output. This would involve creating a morph guessing module which will be discussed later.

It is worthwhile to discuss the possibility of propagating the surface form even past the bidix, and give the user an option to either just output the source surface form or source lemma + target morph based on their preference and task.

Modification of words in transfer based on TL tags

Transfer rules quite often use target language information from bidix to fill in tags etc. We could just use the source tags as target tags, this could give a decent result, unless if it's a word that would have changed the gender/number while translation. This could be better than just treating it as an unknown word, but this would warrant a discussion with the community. If the source language doesn't have grammatical gender and the target language does, we could either choose to not translate its dependents, such as determiners, or we could translate them with a default gender.

Whichever we decide, would also help us decide whether to propagate the surface form only till the bidix lookup, or till generation.

Compounds and multiwords

Before apertium-separable, multiwords and compounds were split before bidix lookup into several units. This caused several problems as a partially unknown multiword would cause comprehensibility issues with the final output, and hence trimming prevents this by ensuring that a multiword or a compound doesn't stay in a monodix unless if it will be fully translated. This was one of the reasons why trimming was done. However, with apertium-separable, multiwords aren't split anymore, and trimming becomes detrimental to the translation of multiwords.

If XY is a multiword in the monodix, the trimming algorithm checks if both X and Y are in the bidix, and if they're not, it trims the monodix. However, with separable, the bidix might have XY as one unit even if one of X or Y isn't there, but the trimming algorithm would still trim away the multiword from the monodix. This warrants the elimination of trimming.

Compounds are still split in pretransfer, so the issue there would be that if a compound has one part in the bidix and not the other, the final translation would be odd. To me it seems like since the parts of a compound were going to be translated individually anyway, translating part of it shouldn't be an issue, but should actually help with post-editing.

If the community feels like this is fine, then we don't really need to do much about multiwords and compounds when we remove trimming. If not, there can be ideas about storing the multiword surface forms.

Implementation: Modifying apertium stream format to include arbitrary information

Rationale

To eliminate trimming, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Proposed Modification

The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed.
The number of tags is already arbitrary so that helps.
The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that you can add any amount of information as you want, not only in the language models or the translation modules, but every program can add as well as read information from the stream.

Implementation

For this project I will be taking a test-driven development approach, it fits well for the use case since we have a stable code base with large user base who would not be happy with any regression, i.e. the change in parsers should still work with the current stream format, and the trimming benefits shouldn't be lost either.

I will first be writing tests for the new stream format for every part of the pipeline and then I will start modifying the parsers for the individual parsers. This should be done by the end of the first phase of GSoC. Then I'll experiment with putting the surface form in the secondary format and eliminate dictionary trimming.

As the work plan will reflect, the solution will first be documented, its feasibility, the benefits, disadvantages, the modifications needed - what and where, all before even one line of code is written. The development will then happen in a branch, with extensive regression testing.

Stretch Goal: Morph Guessing for words missing in the bidix

Once trimming is eliminated, I have explained above how we can maintain the benefits of trimming by outputting the source word surface form as an unknown word. However, we can do even better than that by outputting the source lemma + target morph.

For example, Translating from Basque to English:

"Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni *izeki-ed the sheets".

This would help with the post-editing and comprehensibility of the output. It is important to note that it is only viable once we eliminate trimming. The idea is that we propagate the source lemma/source surface form, and based on the source analysis, we guess the corresponding morph in the target language using the target monodix.

Implementation of Morph Guessing

A naive way to implement this could be to use the pardefs in the TL monodix for an analysis to morph mapping, and for the ones where the pardefs aren't so clear, we could take all the words' surface forms with that analysis and try to find a common substring to isolate the morph. This could work decently for prefixing and suffixing languages. We could expand the morphological dictionary and use an algorithm such as OSTIA [1] to learn morphological analyses for word endings.

Example pardef:

<pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
</pardef>

This idea could of course be a project on its own, so I will treat this as a stretch goal for this GSoC, and try to set the foundations for it as part of this project.

About Dictionary Trimming (Coding challenge)

There are currently two ways of Automatically trimming a monodix - using lttoolbox or using hfst.

lttoolbox

Lttoolbox has a command lt-trim. It takes as input a compiled monodix and the corresponding compiled bidix and gives a trimmed compiled monodix as output. Here is a small overview of how Lt-trim works:

It loads the analyser and bidix, and loops through all the analyser states, trying to do the same steps in parallel with the bidix. Transitions which are possible to do in both, are added to the trimmed analyser. So, only those analyses that would pass through bidix when doing lt-proc -b will stay in the final trimmed analyser.

The bidix is preprocessed to match the format of the monodix. This involves doing a union of all sections into one big section. Then an effective .* appended in the bidix entries so that if "foo<vblex>" is in there, it will match with "foo<vblex><pres>” in the monodix. Lastly, it also moves all lemq's (the # or <g> group elements) to after the tags in the bidix entries as the monodix always has the # part after tags, while bidix has them on the lemma.

Once this is done, the intersect takes place. It also deals with #-type multiwords, i.e. multiwords that have an invariable part, by changing the format of a multiword in the bidix take# out<vblex> to the format of a multiword in the monodix take<vblex># out. For multiwords with +, one part is matched with the bidix, and then for each consequence part, the analysis is searched again from the start of the bidix, in effect searching each part in the word individually.

hsft

hfst is the Helsinki finite-state toolkit, which is used to build finite state transducers for processing morphologies. Since several language pairs, such as apertium-sme-nob, apertium-fin-sme, apertium-kaz-tat, use hfst instead of lttoolbox for the morph analysis, the trimming needs to be done in hfst as well. The following is a snippet of code which shows how trimming is done using hfst.

hfst-invert test-en.fst -o test-en.mor.fst
hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
echo " ?* " | hfst-regexp2fst > any.fst
hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst

test-en.fst is the compiled monolingual dictionary (input: source surface form, output: source lemma + analysis) and test-en-eu.fst is the compiled bilingual dictionary (input: source lemma + analysis, output: target lemma + analysis). The monodix fst is inverted (input and output labels exchanged), and the bidix fst is projected to create a transducer of just the input strings. Then, the any.fst is concatenated to this projected fst so that it accepts foo<n><sg> even if the bidix only has foo<n>. This is called the test-en-eu.en-prefixes.fst.

Then the prefixes fst and the inverted monodix fst is intersected(only strings which will be accepted by both of them), and a new fst is composed. This is finally inverted to again have the surface forms as input and lemma+analysis as output. test-en.trimmed.fst is the final trimmed morph analyser.

Trimming in apertium-sme-nob

Trimming in sme-nob is a tad bit different from the implementation described above. apertium-sme-nob has compounds which have words separated by '+'. As mentioned earlier, these are split before a bidix lookup, so the bidix will not have any multiwords (with a +).

To deal with this, we modify the bidix such that it will match with normal words as discussed earlier, but after that it adds an optional + and then the entire bidix match again X2. In terms of regex, this looks something like bidix [^+]* (+ bidix [^+]*){0,2}. In effect, this means that it will later match a monodix multiword with at most two +s with the words in the bidix by matching part of the multiword with each word in the bidix. Here is the recipe for this process, as found in the Makefile.

# Override prefixes from ap_include, since we need the derivation-pos-changes:
.deps/%.autobil.prefixes: %.autobil.bin .deps/.d
	lt-print $< | sed 's/ /@_SPACE_@/g' > .deps/$*.autobil.att
	hfst-txt2fst -e ε -i  .deps/$*.autobil.att -o .deps/$*.autobil-split.hfst
	hfst-head      -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-head.hfst
	hfst-tail -n+2 -i .deps/$*.autobil-split.hfst -o .deps/$*.autobil-tail.hfst
	hfst-union -2 .deps/$*.autobil-head.hfst -1 .deps/$*.autobil-tail.hfst -o .deps/$*.autobil.hfst
	hfst-project -p upper .deps/$*.autobil.hfst -o .deps/$*.autobil.upper                                   # bidix
	echo '[ "<n>" -> [ "<n>" | "<ex_n>" ] ] .o. [ "<adj>" -> [ "<adj>" | "<ex_adj>" ] ] .o. [ "<vblex>" -> [ "<vblex>" |"<ex_vblex>" ] ] .o. [ "<iv>" -> [ "<iv>" | "<ex_iv>" ] ] .o. [ "<tv>" -> [ "<tv>" | "<ex_tv>" ] ]' \
		| hfst-regexp2fst -o .deps/$*.derivpos.hfst
	hfst-compose -1 .deps/$*.autobil.upper -2 .deps/$*.derivpos.hfst -o .deps/$*.autobil-derivpos.hfst
	hfst-project -p lower .deps/$*.autobil-derivpos.hfst -o .deps/$*.autobil-derivpos.hfsa                  # bidix with n -> n|ex_n
	echo ' [ ? - %+ ]* ' | hfst-regexp2fst > .deps/$*.any-nonplus.hfst                                                        # [^+]*
	hfst-concatenate -1 .deps/$*.autobil-derivpos.hfsa -2 .deps/$*.any-nonplus.hfst -o .deps/$*.autobil.nonplussed    # bidix [^+]*
	echo ' %+ ' | hfst-regexp2fst > .deps/$*.single-plus.hfst                                                                 # +
	hfst-concatenate -1 .deps/$*.single-plus.hfst -2 .deps/$*.autobil.nonplussed -o .deps/$*.autobil.postplus # + bidix [^+]*
	hfst-repeat -f0 -t2 -i .deps/$*.autobil.postplus -o .deps/$*.autobil.postplus.0,2                      # (+ bidix [^+]*){0,2} -- gives at most two +
	hfst-concatenate -1 .deps/$*.autobil.nonplussed -2 .deps/$*.autobil.postplus.0,2 -o $@                 # bidix [^+]* (+ bidix [^+]*){0,2}

Once the bidix is modified to deal with compounds, we move on to the trimming. Since makefiles don't execute in order, here I have rearranged the order of the statements in this Makefile so that they're in the order of execution.

# -------------------
# Northern Saami analysis:
# -------------------

.deps/$(LANG1).automorf.hfst: $(AP_SRC1)/apertium-und.$(LANG1)-und.LR.att.gz .deps/.d
	$(ZCAT) $< | hfst-txt2fst > $@

.deps/rm-deriv-cmp.hfst: rm-deriv-cmp.twol .deps/.d
	hfst-twolc -i $< -o $@

.deps/$(LANG1).automorf-rmderiv.hfst:           .deps/$(LANG1).automorf.hfst          .deps/rm-deriv-cmp.hfst
	hfst-compose-intersect -1 $< -2 .deps/rm-deriv-cmp.hfst -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst: .deps/$(LANG1).automorf-rmderiv.hfst .deps/$(PREFIX1).autobil.prefixes
	hfst-compose-intersect -1 $< -2 .deps/$(PREFIX1).autobil.prefixes -o $@

.deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed.hfst
	hfst-minimize -i $< -o $@

$(PREFIX1).automorf.hfst: .deps/$(PREFIX1).automorf-rmderiv-trimmed-min.hfst
	hfst-fst2fst -w -i $< -o $@

$(PREFIX1).automorf-untrimmed.hfst: .deps/$(LANG1).automorf.hfst
	hfst-fst2fst -w -i $< -o $@

In the first three statements, the twol file is compiled and then composed with the lexc compiled file to make the overall morph analyser, as explained in Starting a new language with HFST.

Now for the trimming, In the fourth statement we do a compose-intersect between the automorf fst (compiled monodix) and the autobil prefixes that we prepared above (modified bidix). Then, using hfst-fst2fst it converts the fst to an optimized-lookup (weighted) implementation and the final trimmed file is named $(PREFIX1).automorf.hfst. The unmodified automorf is put in the same optimised-lookup implementation and named $(PREFIX1).automorf-untrimmed.hfst. Note that PREFIX1=$(LANG1)-$(LANG2).

Work Plan (TODO)

Application Review Period (April 1 - May 4)

Write tests to prepare for test driven development
Run some experiments with the new stream format - document how much change is needed in which parsers

Community Bonding Period (May 4 - June 1)

Week 1-4 (June 1 - )

Deliverable #1: Stream Modified such that the new format passes through each part of the pipeline

Evaluation 1: June

Week 5-8 (June )

Deliverable #2: Trimming eliminated without regression of benefits

Evaluation 2: July

Week 9-12 (July )

Final Evaluations: August 24-

Project Completed

NOTE: The third phase of the project has extra time to deal with unforeseen issues and ideas

A description of how and who it will benefit in society

It will definitely benefit most users of Apertium and hopefully will attract more people to the tool. By discarding the morphological analysis of the word too early we prevent modules like lexical selection and transfer from using it, which would really benefit from this information. Secondly, as I've described earlier, there are several projects and ideas that would be possible once trimming is eliminated, because now the developer will have the option to use this morphological information anywhere in the pipeline. The morph guessing project will make the output of translation from Apertium much for comprehensive and would help with both gisting translation and post-editing, and hence help all kinds of users of Apertium.

By modifying the stream to include arbitrary information, we also open the doors for several future applications, which can now add and use any info in the pipeline. This would be useful for markup reordering, semantic information, capitalisation, theta roles, sentiment tags, etc.

I’m from India and for a lot of our languages, we don’t have the data to create reliable Neural MT systems. Similarly, for all resource poor languages, Apertium provides an easy and reliable MT system for their needs. That’s how Apertium benefits society already.

Reasons why Google and Apertium should sponsor it

I've been a regular contributor with Apertium for more than a year now, and this project is one which aims to modify almost every part of the pipeline for the better. The funding that I receive will help me to focus my time and resources on this project so that it can be adequately completed in three months.

By funding this project, Google will help improve an important Open Source tool and promote Open Source Development. In a world of Proprietary softwares, this is an invaluable resource for society and supports innovation that everyone can benefit from.

Skills and Qualifications

I'm currently a fourth year student and an Undergraduate Researcher at IIIT Hyderabad where I'm studying Computational Linguistics. It is a dual degree where we study Computer Science, Linguistics, NLP and more. I'm working on Machine Translation in the LTRC lab in IIIT Hyderabad and I'm part of the MT group in our university.

I've been interested in linguistics from the very beginning and due to the rigorous programming courses, I'm also adept at several programming languages like Python, C++, XML, Bash Scripting, etc. I'm skilled in writing Finite State Transducers, Algorithms. Data Structures, and Machine Learning Algorithms as well.

I also have a lot of experience studying data which I feel is essential in solving any problem.

I've worked with Apertium as part of GSoC 2019 and built the Anaphora resolution module, and hence I'm familiar with the codebase, writing parsers for the stream format, and the community which will help me to dive right in the project and make a significant contribution right from the start. I have worked in several other projects, such as a tool that predicts commas and sentence boundaries in ASR output using pitch, building a Translation Memory, Detecting Homographic Puns, POS Taggers, Grammar and Spell Checkers, Named Entity Recognisers, Building Chatbots, etc. all of which required a working understanding of Natural Language Processing. Most of these projects were done offline in my research lab and aren't available on GitHub because of the privacy settings but can be provided if needed.

I am fluent in English, Hindi and have basic knowledge of Spanish.

The details of my skills and work experience can be found here: CV

Non-Summer-Of-Code Plans

I have no plans apart from GSoC in the summer and can devote 30-40 hours a week for this project.

@@ Line 90: / Line 90: @@
 <pre>^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$</pre>
+Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.
 * This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed.

Difference between revisions of "User:Khannatanmai/GSoC2020Proposal Trimming"