Lexical selection

From Apertium
Jump to navigation Jump to search

Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.

This page has some links to pages about lexical selection in Apertium.

General information:

Current lexical selection module (2012)

This is made by [Francis Tyers] an is deployed in XX-XX language pair where you can see an example.

The slr/srl approach (2010-2012)

This uses a special Constraint Grammar (CG) file which runs _after_ regular morphological disambiguation, but _before_ bidix:

morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | transfer | morf. generation

The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$ might turn into ^ahte:1<CC>$.

The bidix has entries like

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".

Transfer rule approach (2009)

You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:

  • get big transfer files
  • mix transfer and lexical selection
  • must write rules

This is the method used in most pairs.

Deprecated (2007)