Difference between revisions of "Lexical selection"
(remove links to deprecated stuff) |
|||
(18 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
[[Sélection lexicale|En français]] |
|||
{{TOCD}} |
{{TOCD}} |
||
'''Lexical selection''' is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of [[word-sense disambiguation]]. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation. |
'''Lexical selection''' is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of [[word-sense disambiguation]]. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation. |
||
Line 8: | Line 10: | ||
* [[Word sense disambiguation]] |
* [[Word sense disambiguation]] |
||
== Current lexical selection module ( |
== Current lexical selection module (2012–current) == |
||
The [[Constraint-based lexical selection module]] / apertium-lex-tools is made by [[User:Francis Tyers|Francis Tyers]] and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example. |
|||
This uses a module which runs ''after'' bidix, where the bidix output is ambiguous: |
|||
<pre> |
|||
morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation |
|||
</pre> |
|||
In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output). |
|||
Some documentation: |
|||
* [[Rule-based lexical selection module]] |
* [[Rule-based lexical selection module]] |
||
* [[ |
* [[Learning rules from parallel and non-parallel corpora]] |
||
* [[How to get started with lexical selection rules]] |
* [[How to get started with lexical selection rules]] |
||
** [[Как начать работу с правилами по выбору лексики]] |
** [[Как начать работу с правилами по выбору лексики]] |
||
== Old and alternative approaches == |
|||
⚫ | |||
⚫ | |||
Could someone from sme-nob please explain? |
|||
This was used in [[apertium-sme-nob]] until lately. |
|||
This uses a special [[Constraint Grammar]] (CG) file which runs ''after'' regular morphological disambiguation, but ''before'' bidix: |
|||
<pre> |
|||
morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation |
|||
</pre> |
|||
The CG rules add a number to the lemma of the word if we want a non-default translation, so <code>^ahte<CC>$</code> might turn into <code>^ahte:1<CC>$</code>. |
|||
⚫ | |||
The bidix has entries like |
|||
<pre> |
|||
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> |
|||
<e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e> |
|||
</pre> |
|||
This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains |
|||
<pre> |
|||
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> |
|||
<e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e> |
|||
</pre> |
|||
So if the CG rule fired, and turned '''ahte''' into '''ahte:1''', we get "og at" instead of "at". |
|||
Downsides with this approach: |
|||
* pairs which only want lex.sel require the user to install vislcg3 |
|||
* developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure. |
|||
** On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å») |
|||
apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc. |
|||
⚫ | |||
You can make transfer rules that does lexical selection. |
You can make transfer rules that does lexical selection. |
||
Line 32: | Line 69: | ||
* must write rules |
* must write rules |
||
This is the method used in most pairs. |
This is the method used in most trunk pairs. |
||
⚫ | |||
[[Lextor]] works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. '''The module is turned off in most cases as it does not provide an improvement over the baseline.''' |
|||
⚫ | |||
* [[Lextor]] |
|||
* [[Lexical selection in target language]] |
|||
* [[Limited rule-based lexical selection]] |
|||
* [[Generating lexical-selection rules]] |
|||
Latest revision as of 08:36, 29 April 2015
Lexical selection is the task of choosing, given several source-language (SL) translations with the same part-of-speech (POS), the most adequate translation among them in the target language (TL). The task is related to the task of word-sense disambiguation. The difference is that its aim is to find the most adequate translation, not the most adequate sense. Thus, it is not necessary to choose between a series of fine-grained senses if all these senses result in the same final translation.
This page has some links to pages about lexical selection in Apertium.
General information:
Current lexical selection module (2012–current)[edit]
The Constraint-based lexical selection module / apertium-lex-tools is made by Francis Tyers and is deployed in the apertium-sh-mk and apertium-kaz-tat language pairs where you can see an example.
This uses a module which runs after bidix, where the bidix output is ambiguous:
morf.analysis | morf.disambiguation | bidix | lexical selection | structural transfer | morf. generation
In a sense, it disambiguates the bidix output (in exactly the same way that morf.disambiguation disambiguates the morf.analysis output).
Some documentation:
- Rule-based lexical selection module
- Learning rules from parallel and non-parallel corpora
- How to get started with lexical selection rules
Old and alternative approaches[edit]
The slr/srl + CG approach (2010-2012)[edit]
This was used in apertium-sme-nob until lately.
This uses a special Constraint Grammar (CG) file which runs after regular morphological disambiguation, but before bidix:
morf.analysis | morf.disambiguation (cg or apertium-tagger) | cg lexical selection | bidix | structural transfer | morf. generation
The CG rules add a number to the lemma of the word if we want a non-default translation, so ^ahte<CC>$
might turn into ^ahte:1<CC>$
.
The bidix has entries like
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e slr="1"><p><l>ahte<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
This is pre-processed by an XSLT script, so the file that is given to lt-comp actually contains
<e> <p><l>ahte<s n="CC"/></l><r>at<s n="cnjcoo"/><s n="clb"/></r></p></e> <e R="lr"><p><l>ahte:1<s n="CC"/></l><r>og<b/>at<s n="cnjcoo"/><s n="clb"/></r></p></e>
So if the CG rule fired, and turned ahte into ahte:1, we get "og at" instead of "at".
Downsides with this approach:
- pairs which only want lex.sel require the user to install vislcg3
- developers need to remember when they write the rules that number 1 was "og at" and number 0 was "at", which can get confusing (especially if you decide to change the default) – more points of failure.
- On the other hand side, lexical selection can most often be seen as a / default - special case / dichotomy. A good mode of work is to introduce each rule set with the number array, e.g.: # leat 0 = være, 1 = ha, 2 = måtte («ha å»)
apertium-sme-nob in 2014 switched to bidix before lex.sel (like the lrx-proc method), but still uses vislcg3 rules instead of lrx-proc.
Transfer rule approach (2009)[edit]
You can make transfer rules that does lexical selection. Its not very elegant but it works, to a degree. The drawback is that you:
- get big transfer files
- mix transfer and lexical selection
- must write rules
This is the method used in most trunk pairs.
Lextor (2007)[edit]
Lextor works using statistics and requires 1) slightly pre-processed dictionaries and 2) corpora to train the module. The module is turned off in most cases as it does not provide an improvement over the baseline.