Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.
This page has some links to pages about lexical selection in Apertium.
Current lexical selection module (2012–current)
This is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.
This uses a module which runs after bidix, where the bidix output is ambiguous:
morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation
In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).
- Rule-based lexical selection module
- Generating lexical-selection rules from a parallel corpus
- How to get started with lexical selection rules
The slr/srl approach (2010-2012)
Used in apertium-sme-nob.
This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:
morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation
The CG rules add a number to the lemma of the word if we want a non-default translation, so
^ahte<CC>$ might turn into
The bidix has entries like
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".
Downsides with this approach:
- pairs which only want lex.sel require the user to install vislcg3
- developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
- On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
Transfer rule approach (2009)
You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:
- get big transfer files
- mix transfer and lexical selection
- must write rules
This is the method used in most pairs.
- Lextor – works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.