Difference between revisions of "Ideas for Google Summer of Code/Eliminate trimming"
TommiPirinen (talk | contribs) (Created page with "Dictionary trimming is a thing in apertium where we remove stuff from monolingual language models (FSTs compiled from monodixes) so they only contain word-forms that the trans...") |
m (un typo) |
||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Dictionary trimming is a thing in apertium where we remove stuff from monolingual language models (FSTs compiled from monodixes) so they only contain word-forms that the translation model (FST compiled of bidix) knows of. The practical effect of this is that words missing from bidix are treated the same as words missing from monodix, making debugging harder. This is |
Dictionary trimming is a thing in apertium where we remove stuff from monolingual language models (FSTs compiled from monodixes) so they only contain word-forms that the translation model (FST compiled of bidix) knows of. The practical effect of this is that words missing from bidix are treated the same as words missing from monodix, making debugging harder. This is workaroundable by compiling trimmed and untrimmed FSAs separately for debug and development process and adding more modes and trying to remember which modes go with which but it's error-prone and unmanageable. Furthermore, throwing good information away early is not a good thing, even when bidix is missing some stuff other parts of the pipeline may benefit from the stuff that got thrown out. Ideally we would want to keep maximal amount of stuff intact and usable and only programmatically select what is displayed when: source language or target language, word-form or lemma... |
||
Line 8: | Line 8: | ||
* Check out different trimming methods in apertium pairs, including HFST trimming and lttoolbox trimming (Norwegian, North Sámi...) |
* Check out different trimming methods in apertium pairs, including HFST trimming and lttoolbox trimming (Norwegian, North Sámi...) |
||
* Figure out and explain trimming in https://github.com/apertium/apertium-sme-nob/ (hint: Makefile.am) |
|||
== Further reading == |
== Further reading == |
||
* [[Why we trim]] |
|||
* Unhammer et al. (20xx) Trimming... |
* Unhammer et al. (20xx) Trimming... |
||
[[Category:Ideas_for_Google_Summer_of_Code]] |
Latest revision as of 10:26, 23 April 2020
Dictionary trimming is a thing in apertium where we remove stuff from monolingual language models (FSTs compiled from monodixes) so they only contain word-forms that the translation model (FST compiled of bidix) knows of. The practical effect of this is that words missing from bidix are treated the same as words missing from monodix, making debugging harder. This is workaroundable by compiling trimmed and untrimmed FSAs separately for debug and development process and adding more modes and trying to remember which modes go with which but it's error-prone and unmanageable. Furthermore, throwing good information away early is not a good thing, even when bidix is missing some stuff other parts of the pipeline may benefit from the stuff that got thrown out. Ideally we would want to keep maximal amount of stuff intact and usable and only programmatically select what is displayed when: source language or target language, word-form or lemma...
Task[edit]
Coding challenge[edit]
- Check out different trimming methods in apertium pairs, including HFST trimming and lttoolbox trimming (Norwegian, North Sámi...)
- Figure out and explain trimming in https://github.com/apertium/apertium-sme-nob/ (hint: Makefile.am)
Further reading[edit]
- Why we trim
- Unhammer et al. (20xx) Trimming...