User:Khannatanmai/Eliminating Dictionary Trimming
Proposal: User:Khannatanmai/GSoC2020Proposal_Trimming
Contents
Rationale
The monodix of a language is generally larger than the bidix for a language pair involving that language pair. It was noticed that if used as is, there are a lot of translation errors (the ones with @), which basically just put the lemma of the source language if a translation isnt available. To deal with this, dictionary trimming was added, which basically removed a word from the monodix if it wasn't present in the bidix and it went through the pipeline as an unknown word and the source surface form was found in the final translation (with a *), which is arguably better and more intelligible than just the source lemma.
However, trimming meant giving up certain benefits. Let's look at these benefits in greater detail:
- Lexical Selection: By discarding the analysis of a word in the source language, we lose the ability to use it as context to disambiguate words in its context. Assume a [Noun Adjective] in which the we don't know the translation of the Adjective, i.e. it isn't in the bidix. With trimming we would discard it and hence if the Noun has several ambiguous forms, we have no way to disambiguate it since we've discarded the analysis of the Adjective (which included the fact that it's an adjective)
- Transfer: In the same example, assume that in the target language, [Noun Adj] is to be rearranged into [Adj Noun]. With trimming, this can't be done as we've discarded the analysis of the Adjective, treating it as an unknown word.
Now, if we don't discard the analysis and don't trim, we would again fall into the earlier problem of untranslated lemmas.
This project, is a way to have our cake and eat it too. We don't discard the analysis even if we don't know the translation, but we don't just output the lemma either - we output the source surface form. For a solution like this, it is essential that we propagate the surface form till at least transfer or even till the generator, so that we can use the benefits of the source analysis and then before translation, we discard it and use the source surface form.
Currently the source surface form is discarded at the tagger. This is where the stream modification comes in. It's a robust way to propagate the surface form through the stream with least disruption to the current modules.
Solution
Propagate surface form. Generate surface form of source if word in monodix but not in bidix, to maintain benefits of trimming while also keeping source analysis to remove the disadvantages of trimming.
Modifications Needed
- Each module will be modified to have the ability to access and add to the stream secondary information in the form of secondary tags, as explained in the proposal. Given this modification, the following modules will be modified to implement dictionary trimming.
Tagger
- Modified to not remove the surface form of the LU and instead, add it in a secondary tag. <sf:potatoes>
Pretransfer
- Depends on what we decide wrt compounds. With no modification, a compound XZY/X+Y, in which only Y is in the bidix, will translate as XZY Y, which can often make the translation worse.
- One solution is to effectively keep trimming compounds if we don't have the full translation of these.
- If we can somehow find the surface form of the parts of a compound, then we can go ahead with partial translations.
Transfer
- Need to ensure secondary tags stay stuck to their counterparts in TL. Should already be done during the stream modification.
Generator
- Modified to generate the source surface form of a word which doesn't have a translation, instead of the lemma with @.