Difference between revisions of "Lexical selection"

From Apertium
Jump to navigation Jump to search
(Link to French page)
(remove links to deprecated stuff)
 
(3 intermediate revisions by the same user not shown)
Line 12: Line 12:
== Current lexical selection module (2012–current) ==
== Current lexical selection module (2012–current) ==


This is made by [[User:Francis Tyers|Francis Tyers]] and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.
The [[Constraint-based lexical selection module]] / apertium-lex-tools is made by [[User:Francis Tyers|Francis Tyers]] and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.


This uses a module which runs ''after'' bidix, where the bidix output is ambiguous:
This uses a module which runs ''after'' bidix, where the bidix output is ambiguous:
Line 22: Line 22:
Some documentation:
Some documentation:
* [[Rule-based lexical selection module]]
* [[Rule-based lexical selection module]]
* [[Generating lexical-selection rules from a parallel corpus]]
* [[Learning rules from parallel and non-parallel corpora]]
* [[How to get started with lexical selection rules]]
* [[How to get started with lexical selection rules]]
** [[Как начать работу с правилами по выбору лексики]]
** [[Как начать работу с правилами по выбору лексики]]


== Old and alternative approaches ==
== The slr/srl approach (2010-2012) ==

Used in [[apertium-sme-nob]].
=== The slr/srl + CG approach (2010-2012) ===
This was used in [[apertium-sme-nob]] until lately.


This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
Line 55: Line 57:
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)


apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.
== Transfer rule approach (2009) ==

=== Transfer rule approach (2009) ===


You can make transfer rules that does lexical selection.
You can make transfer rules that does lexical selection.
Line 65: Line 69:
* must write rules
* must write rules


This is the method used in most pairs.
This is the method used in most trunk pairs.

== Deprecated (2007) ==


=== Lextor (2007) ===
* [[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''


[[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''
== See also ==




* [[Limited rule-based lexical selection]]
* [[Generating lexical-selection rules]]
* [[Lexical selection in target language]]





Latest revision as of 08:36, 29 April 2015

En français

Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.

This page has some links to pages about lexical selection in Apertium.

General information:

Current lexical selection module (2012–current)[edit]

The Constraint-based lexical selection module / apertium-lex-tools is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.

This uses a module which runs after bidix, where the bidix output is ambiguous:

morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation

In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).

Some documentation:

Old and alternative approaches[edit]

The slr/srl + CG approach (2010-2012)[edit]

This was used in apertium-sme-nob until lately.

This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:

morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation

The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$ might turn into ^ahte:1<CC>$.

The bidix has entries like

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".


Downsides with this approach:

  • pairs which only want lex.sel require the user to install vislcg3
  • developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
    • On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)

apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.

Transfer rule approach (2009)[edit]

You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:

  • get big transfer files
  • mix transfer and lexical selection
  • must write rules

This is the method used in most trunk pairs.

Lextor (2007)[edit]

Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.