Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.
This page has some links to pages about lexical selection in Apertium.
Current lexical selection module (2012–current)
This uses a module which runs after bidix, where the bidix output is ambiguous:
morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation
In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).
- Rule-based lexical selection module
- Learning rules from parallel and non-parallel corpora
- How to get started with lexical selection rules
Old and alternative approaches
The slr/srl + CG approach (2010-2012)
This was used in apertium-sme-nob until lately.
This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:
morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation
The CG rules add a number to the lemma of the word if we want a non-default translation, so
^ahte<CC>$ might turn into
The bidix has entries like
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".
Downsides with this approach:
- pairs which only want lex.sel require the user to install vislcg3
- developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
- On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.
Transfer rule approach (2009)
You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:
- get big transfer files
- mix transfer and lexical selection
- must write rules
This is the method used in most trunk pairs.
Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.