Difference between revisions of "Lexical selection"

From Apertium
Jump to navigation Jump to search
(remove links to deprecated stuff)
 
(27 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[Sélection lexicale|En français]]
Current lexical selection module:


{{TOCD}}
* [[Rule-based lexical selection module]]
'''Lexical selection''' is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of [[word-sense disambiguation]]. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.


This page has some links to pages about lexical selection in Apertium.
Deprecated:

General information:


* [[Word sense disambiguation]]
* [[Word sense disambiguation]]

* [[Lextor]]
== Current lexical selection module (2012–current) ==
* [[Lexical selection in target language]]

* [[Limited rule-based lexical selection]]
The [[Constraint-based lexical selection module]] / apertium-lex-tools is made by [[User:Francis Tyers|Francis Tyers]] and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.

This uses a module which runs ''after'' bidix, where the bidix output is ambiguous:
<pre>
morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation
</pre>
In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).

Some documentation:
* [[Rule-based lexical selection module]]
* [[Learning rules from parallel and non-parallel corpora]]
* [[How to get started with lexical selection rules]]
** [[Как начать работу с правилами по выбору лексики]]

== Old and alternative approaches ==

=== The slr/srl + CG approach (2010-2012) ===
This was used in [[apertium-sme-nob]] until lately.

This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix:
<pre>
morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation
</pre>

The CG rules add a number to the lemma of the word if we want a non-default translation, so <code>^ahte&lt;CC&gt;$</code> might turn into <code>^ahte:1&lt;CC&gt;$</code>.

The bidix has entries like
<pre>
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
</pre>
This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains
<pre>
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
</pre>

So if the CG rule fired, and turned '''ahte''' into '''ahte:1''', we get "og at" instead of "at".


Downsides with this approach:
* pairs which only want lex.sel require the user to install vislcg3
* developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)

apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.

=== Transfer rule approach (2009) ===

You can make transfer rules that does lexical selection.
Its not very elegant but it works, to a degree.
The drawback is that you:

* get big transfer files
* mix transfer and lexical selection
* must write rules

This is the method used in most trunk pairs.

=== Lextor (2007) ===

[[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.'''






Latest revision as of 08:36, 29 April 2015

En français

Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.

This page has some links to pages about lexical selection in Apertium.

General information:

Current lexical selection module (2012–current)[edit]

The Constraint-based lexical selection module / apertium-lex-tools is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.

This uses a module which runs after bidix, where the bidix output is ambiguous:

morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation

In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).

Some documentation:

Old and alternative approaches[edit]

The slr/srl + CG approach (2010-2012)[edit]

This was used in apertium-sme-nob until lately.

This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:

morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation

The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$ might turn into ^ahte:1<CC>$.

The bidix has entries like

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains

<e>            <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>

So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".


Downsides with this approach:

  • pairs which only want lex.sel require the user to install vislcg3
  • developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
    • On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)

apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.

Transfer rule approach (2009)[edit]

You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:

  • get big transfer files
  • mix transfer and lexical selection
  • must write rules

This is the method used in most trunk pairs.

Lextor (2007)[edit]

Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.