Difference between revisions of "Lexical selection"

From Apertium
Jump to navigation Jump to search
Line 23: Line 23:
 
* [[Rule-based lexical selection module]]
 
* [[Rule-based lexical selection module]]
 
* [[Learning rules from parallel and non-parallel corpora]]
 
* [[Learning rules from parallel and non-parallel corpora]]
** old page: [[Generating lexical-selection rules from a parallel corpus]]
 
 
* [[How to get started with lexical selection rules]]
 
* [[How to get started with lexical selection rules]]
 
** [[Как начать работу с правилами по выбору лексики]]
 
** [[Как начать работу с правилами по выбору лексики]]
   
  +
== The slr/srl approach (2010-2012) ==
 
  +
== Old and alternative approaches ==
Used in [[apertium-sme-nob]].
 
  +
 
=== The slr/srl + CG approach (2010-2012) ===
 
This was used in [[apertium-sme-nob]] until lately.
   
 
This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
 
This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
Line 56: Line 58:
 
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
 
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
   
  +
apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.
== Transfer rule approach (2009) ==
 
  +
 
=== Transfer rule approach (2009) ===
   
 
You can make transfer rules that does lexical selection.
 
You can make transfer rules that does lexical selection.
Line 66: Line 70:
 
* must write rules
 
* must write rules
   
This is the method used in most pairs.
+
This is the method used in most trunk pairs.
   
== Deprecated (2007) ==
+
=== Lextor (2007) ===
   
* [[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''
+
[[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''
   
 
== See also ==
 
== See also ==

Revision as of 08:21, 29 April 2015

En français

Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.

This page has some links to pages about lexical selection in Apertium.

General information:

Current lexical selection module (2012–current)

This is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.

This uses a module which runs after bidix, where the bidix output is ambiguous:

morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation

In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).

Some documentation:


Old and alternative approaches

The slr/srl + CG approach (2010-2012)

This was used in apertium-sme-nob until lately.

This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:

morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation

The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$ might turn into ^ahte:1<CC>$.

The bidix has entries like

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".


Downsides with this approach:

  • pairs which only want lex.sel require the user to install vislcg3
  • developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
    • On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)

apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.

Transfer rule approach (2009)

You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:

  • get big transfer files
  • mix transfer and lexical selection
  • must write rules

This is the method used in most trunk pairs.

Lextor (2007)

Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.

See also