Difference between revisions of "Lexical selection"

Revision as of 08:21, 29 April 2015

Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.

This page has some links to pages about lexical selection in Apertium.

General information:

Word sense disambiguation

Current lexical selection module (2012–current)

This is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.

This uses a module which runs after bidix, where the bidix output is ambiguous:

morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation

In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).

Some documentation:

Old and alternative approaches

The slr/srl + CG approach (2010-2012)

This was used in apertium-sme-nob until lately.

This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:

morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation

The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$ might turn into ^ahte:1<CC>$.

The bidix has entries like

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".

Downsides with this approach:

pairs which only want lex.sel require the user to install vislcg3
developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
- On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)

apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.

Transfer rule approach (2009)

You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:

get big transfer files
mix transfer and lexical selection
must write rules

This is the method used in most trunk pairs.

Lextor (2007)

Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.

@@ Line 23: / Line 23: @@
 * [[Rule-based lexical selection module]]
 * [[Learning rules from parallel and non-parallel corpora]]
-** old page: [[Generating lexical-selection rules from a parallel corpus]]
 * [[How to get started with lexical selection rules]]
 ** [[Как начать работу с правилами по выбору лексики]]
-== The slr/srl approach (2010-2012)  ==
+== Old and alternative approaches ==
-Used in [[apertium-sme-nob]].
+=== The slr/srl + CG approach (2010-2012)  ===
+This was used in [[apertium-sme-nob]] until lately.
 This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
@@ Line 56: / Line 58: @@
 ** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
+apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.
-== Transfer rule approach (2009)  ==
+=== Transfer rule approach (2009)  ===
 You can make transfer rules that does lexical selection.
@@ Line 66: / Line 70: @@
 * must write rules
-This is the method used in most pairs.
+This is the method used in most trunk pairs.
-== Deprecated (2007) ==
+=== Lextor (2007) ===
-* [[Lextor]] – works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''
+[[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''
 == See also ==

Difference between revisions of "Lexical selection"

Revision as of 08:21, 29 April 2015

Contents

Current lexical selection module (2012–current)

Old and alternative approaches

The slr/srl + CG approach (2010-2012)

Transfer rule approach (2009)

Lextor (2007)

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools