Difference between revisions of "Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
 
(94 intermediate revisions by 9 users not shown)
Line 1: Line 1:
{{TOCD}}
{{TOCD}}


'''apertium-lex-tools''' provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.
==Lexical transfer==


==Installing==
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.


<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.</span>
This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary.


==Lexical transfer in the pipeline==

lrx-proc runs between bidix lookup and the first stage of transfer, e.g.
<pre>
… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \
| apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | …
</pre>

This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary:
<pre>
<pre>
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$
^el<det><def><f><sg>/the<det><def><f><sg>$
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$
^el<det><def><m><sg>/the<det><def><m><sg>$
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
</pre>
</pre>
Line 17: Line 32:
I.e.
I.e.
<pre>
<pre>
El estació més plujós ser el tardor, i el més sec el estiu
L'estació més plujosa és la tardor, i la més seca l'estiu
</pre>
</pre>


Line 25: Line 40:
</pre>
</pre>


Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.
The module requires [[VM for transfer]], or another apertium transfer implementation without lexical transfer in order to work.


==Rule format==
==Usage==


Make a simple rule file,
A rule is made up of:


<pre>
* An action (select, remove)
<rules>
* A "centre" (the source language token that will be treated)
<rule>
* A target language pattern on which the action takes place
<match lemma="criminal" tags="adj"/>
* A source language context
<match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
</rule>
</rules>
</pre>


Then compile it:
===Text===


<pre>
<pre>
$ lrx-comp rules.xml rules.fst
s ("estació" n) ("season" n) (1 "plujós")
1: 32@32
s ("estació" n) ("season" n) (2 "plujós")
s ("estació" n) ("season" n) (1 "de") (3 "any")
s ("estació" n) ("station" n) (1 "de") (3 "Línia")
s ("prova" n) ("evidence" n) (1 "arqueològic")
s ("prova" n) ("test" n) (1 "estadístic")
s ("prova" n) ("event" n) (-3 "guanyador") (-2 "de")
s ("prova" n) ("testing" n) (-2 "tècnica") (-1 "de")
s ("joc" n) ("game" n) (1 "olímpic")
s ("joc" n) ("set" n) (1 "de") (2 "caràcter")
r ("pista" n) ("hint" n) (1 "més") (2 "llarg")
r ("pista" n) ("clue" n) (1 "més") (2 "llarg")
r ("motiu" n) ("motif" n) (-1 "aquest") (-2 "per")
s ("carn" n) ("flesh" n) (1 "i") (2 "os")
s ("sobre" pr) ("over" n) (-1 "victòria")
s ("dona" n) ("wife" n) (-1 "*" det pos)
s ("dona" n) ("wife" n) (-1 "el") (1 "de")
s ("dona" n) ("woman" n) (1 "de") (2 "*" det pos) (3 "somni")
r ("patró n) ("pattern" n) (1 "*" np ant)
</pre>
</pre>


The input is the output of <code>lt-proc -b</code>,
===Usage===


<pre>
<pre>
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG>
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$
</pre>
</pre>


==Rule format==
;With rules

A rule is made up of an ordered list of:

* Matches
* Operations (select, remove)


<pre>
<pre>
<rule>
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
<match lemma="el"/>
apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin
<match lemma="dona" tags="n.*">
<select lemma="wife"/>
</match>
<match lemma="de"/>
</rule>


<rule>
The
<match lemma="estació" tags="n.*">
rainiest season
<select lemma="season"/>
is the
</match>
autumn, and the
<match lemma="més"/>
driest the
<match lemma="plujós"/>
summer.
</rule>

<rule>
<match lemma="guanyador"/>
<match lemma="de"/>
<match/>
<match lemma="prova" tags="n.*">
<select lemma="event"/>
</match>
</rule>
</pre>
</pre>


===Weights===
;With bilingual dictionary defaults

The rules compete with each other. That is why a weight is assigned to each of them. In the case of a word that has several possible translations in the dictionary, all rules are evaluated. For each possible translation, the weights of the rules that match the context of the use of the word in the sentence are added up, and the translation with the highest value is chosen. For instance, let's consider these two rules:


<pre>
<pre>
<rule weight="0.8">
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
<match lemma="ferotge" tags="adj.*"><select lemma="farouche"/></match>
apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin
</rule>
<rule weight="1.0">
<or>
<match lemma="animal" tags="n.*"/>
<match lemma="animau" tags="n.*"/>
</or>
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match>
</rule>
</pre>


If we have "un animal ferotge", the translation "farouche" will get 0.8 points, and "féroce" will get 1.0. The latter will be chosen.
The

rainiest station
===Operator OR===
is the

autumn, and the
The boolean operator OR can be used, as shown in the previous example:
driest the

summer.
<pre>
<rule weight="1.0">
<or>
<match lemma="animal" tags="n.*"/>
<match lemma="animau" tags="n.*"/>
</or>
<match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match>
</rule>
</pre>
</pre>


===XML===
===Sequences===


Often, the same words are used in OR's. For readability and maintainability, they can be defined in a special sequence bloc, for instance:


<pre>
==Rule application process==
<def-seqs>
<def-seq n="jorns"><or>
<match lemma="diluns" tags="n.*"/>
<match lemma="dimars" tags="n.*"/>
<match lemma="dimècres" tags="n.*"/>
<match lemma="dimèrcs" tags="n.*"/>
<match lemma="dijòus" tags="n.*"/>
<match lemma="dijaus" tags="n.*"/>
<match lemma="divendres" tags="n.*"/>
<match lemma="divés" tags="n.*"/>
<match lemma="dissabte" tags="n.*"/>
<match lemma="dimenge" tags="n.*"/>
</or></def-seq>


<def-seq n="meses"><or>
<match lemma="genèr" tags="n.*"/>
<match lemma="genièr" tags="n.*"/>
<match lemma="janvièr" tags="n.*"/>
<match lemma="gèr" tags="n.*"/>
<match lemma="febrièr" tags="n.*"/>
<match lemma="heurèr" tags="n.*"/>
<match lemma="hrevèr" tags="n.*"/>
<match lemma="herevèr" tags="n.*"/>
<match lemma="herbèr" tags="n.*"/>
<match lemma="hiurèr" tags="n.*"/>
<match lemma="març" tags="n.*"/>
<match lemma="abrial" tags="n.*"/>
<match lemma="abril" tags="n.*"/>
<match lemma="abriu" tags="n.*"/>
<match lemma="abrieu" tags="n.*"/>
<match lemma="mai" tags="n.*"/>
<match lemma="junh" tags="n.*"/>
<match lemma="julh" tags="n.*"/>
<match lemma="juin" tags="n.*"/>
<match lemma="gulh" tags="n.*"/>
<match lemma="julhet" tags="n.*"/>
<match lemma="gulhet" tags="n.*"/>
<match lemma="junhsèga" tags="n.*"/>
<match lemma="agost" tags="n.*"/>
<match lemma="aost" tags="n.*"/>
<match lemma="setembre" tags="n.*"/>
<match lemma="seteme" tags="n.*"/>
<match lemma="octobre" tags="n.*"/>
<match lemma="octòbre" tags="n.*"/>
<match lemma="novembre" tags="n.*"/>
<match lemma="noveme" tags="n.*"/>
<match lemma="decembre" tags="n.*"/>
<match lemma="deceme" tags="n.*"/>
</or></def-seq>
</def-seqs>
</pre>


They have to be referenced in the rules as follows:


<pre>
;Optimal application
<rule weight="1.0">
<or>
<seq n="jorns"/>
<seq n="meses"/>
<match lemma="prima" tags="n.*"/>
<match lemma="estiu" tags="n.*"/>
<match lemma="auton" tags="n.*"/>
<match lemma="ivèrn" tags="n.*"/>
</or>
<match lemma="passat" tags="adj.*"><select lemma="dernier"/></match>
</rule>
</pre>


Note that if you add a <code>&lt;def-seqs&gt;</code> section and you had only a <code>&lt;rules&gt;</code> section already, then you'll need to put both inside of an <code>&lt;lrx&gt;</code> section:
We're interested in the longest match, but not left to right, so what we do is make an automata of the rule contexts (one rule is one transducer, then we compose them), and we read through them, each state is an LU, It needs to be non-deterministic, and you keep a log of alive paths/states, but also their "weight" (how many transitions have been made) -- the longest for each of the ambiguous words is the winner when we get to the end of the sentence.
<pre>
<lrx>
<def-seqs>
...
</def-seqs>
<rules>
...
</rules>
</lrx>
</pre>


===Operator REPEAT===
==Writing and generating rules==


Imagine the translation of the Occitan word "còrn" that may be "corner" or "horn" (of an animal). We could have as a first version:
===Writing===


<pre>
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.
<rule weight="0.8">
<match lemma="còrn" tags="n.*"><select lemma="corner"/></match>
</rule>
<rule weight="1.0" >
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
<match lemma="de" tags="pr"/>
<or>
<seq n="animals"/>
</or>
</rule>
</pre>


But this will not match if we have an adjective that follows "còrn" (usually adjectives follow the nouns in Occitan). We could add a rule like:


<pre>
<rule weight="1.0" >
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
<match tags="adj.*"/>
<match lemma="de" tags="pr"/>
<or>
<seq n="animals"/>
</or>
</rule>
</pre>


Using the operator REPEAT we can have a more compact way just expanding rule 2:
===Generating===


<pre>
<rule weight="1.0" >
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
<repeat from="0" upto="2">
<match tags="adj.*"/>
</repeat>
<match lemma="de" tags="pr"/>
<or>
<seq n="animals"/>
</or>
</rule>
</pre>


Note that now we are even accepting two adjectives after "còrn" instead of only one (without adding a fourth rule for dealing with two adjectives).


And, if we think that horn can be not only big, but also "very big", we can improve the rule this way:
===Jacob's critique of the rule format===


<pre>
By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)
<rule weight="1.0" >
<match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
<repeat from="0" upto="3">
<or>
<match tags="adv"/>
<match tags="adj.*"/>
</or>
</repeat>
<match lemma="de" tags="pr"/>
<or>
<seq n="animals"/>
</or>
</rule>
</pre>


Next, a second REPEAT block could be added between the preposition "de" and the sequence to deal with the possible existence of determiners, adjectives, etc.
So I'd write

====REPEAT hack====

Sometimes, a lexical selection has unclear rules. For instance the Occitan noun "cosina" may be "(female) cousin" or "kitchen". We can decide that the latter is the most usual translation, so it will be the default. On the other hand, we will select "cousin" if there is another parent term nearby, such as "father", "mother" or "brother". For this we can do something like:


<pre>
<pre>
<rule>  
<rule weight="0.8">
<match lemma="el"/>  
<match lemma="cosina" tags="n.*"><select lemma="kitchen"/></match>
</rule>
<match lemma="dona" tags="n.*"> <select lemma="wife"/> </match>
<match lemma="de"/>
<rule weight="1.0" >
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match>
</rule>
<repeat from="0" upto="4">
<or>
<match tags=""/>
<match tags="*"/>
</or>
</repeat>
<or>
<seq n="familia"/>
</or>
</rule>
<rule weight="1.0" >
<or>
<seq n="familia"/>
</or>
<repeat from="0" upto="4">
<or>
<match tags=""/>
<match tags="*"/>
</or>
</repeat>
<match lemma="cosina" tags="n.*"><select lemma="cousin"/></match>
</rule>
</pre>
</pre>


Rule 2 selects "cousin" if, at most, after four words there is a family word. Rule 3 does the same, but looking at up to 4 words in front. Note the OR operator within the REPEAT: <i><nowiki><match tags="*"/></nowiki></i> matches any known word (i.e. that gets a morphological analysis), while <i><nowiki><match tags=""/></nowiki></i> matches unknown words (i.e. without any morphological tag). Without the OR operation, the rules would try to match precisely a sequence of one unknown word followed by one known one.
and actually you would like to work on categories as well as lemmas.


===Macros===
Like, to prefer (human) beings and not things before "feel".

A macro is a set of rules for a common purpose that can be used for several words. For instance, quite often a verb has a different translation if it is pronominal or not, or if it is transitive or not.

Let's take as a example the Occitan verb "recordar" that is usually translated into French as "rappeler" ("remind"), but as a pronominal verb it would be "(se) souvenir" ("remember"). The problem is that to recognise a pronominal context one needs quite a few rules to prove that there is a personal unstressed pronoun before (or after) the verb and that it has the same person and number as the verb. So a macro could be created like:


<pre>
<pre>
<!-- p1 = verb oci, p2 = verb fra no pron, p3 = verb fra pron -->
<rule>
<def-macro n="verb_nopron_pron" npar="3">
<match tags="n.*"> <select cat="beings"/> </match>
<match lemma="feel"/>
<rule weight="0.8">
<match plemma="1" tags="vblex.*"><select plemma="2"/></match>
</rule>
</rule>
<rule weight="1.0">
<match tags="prn.pro.p1.*.sg"/>
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
</rule>
<rule weight="1.0">
<match tags="prn.pro.p1.*.sg"/>
<match tags="prn.pro.*"/>
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
</rule>
<rule weight="1.0">
<match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
<match tags="prn.enc.p1.*.sg"/>
</rule>
<rule weight="1.0">
<match tags="prn.pro.p2.*.sg"/>
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
</rule>
<rule weight="1.0">
<match tags="prn.pro.p2.*.sg"/>
<match tags="prn.pro.*"/>
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
</rule>
<rule weight="1.0">
<match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
<match tags="prn.enc.p2.*.sg"/>
</rule>

</def-macro>
</pre>

The call to the rule should be:

<pre>
<macro n="verb_nopron_pron"><with-param v="recordar"/><with-param v="rappeler"/><with-param v="souvenir"/></macro>
</pre>
</pre>


For other verbs a call to the same macro is sufficient. The code is much more readable and maintainable than without macros.
--[[User:Jacob Nordfalk|Jacob Nordfalk]] 12:22, 30 November 2011 (UTC)


===Compiled===
===Special cases===
====Matching a capitalized word====


Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. <nowiki>cour<g><b/>d'assises</g></nowiki>) will be elected:
The general structure is as follows:


<pre>
<pre>
<rule weight="0.8">
<match lemma="audiència" tags="n.*"><select lemma="audience"/></match>
</rule>
<rule weight="1.0">
<match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match>
</rule>
</pre>


====Matching an unknown word====
LSSRECORD = id, len, weight;


Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:
<ALPHABET>

<NUM_TRANSDUCERS>
<pre>
<TRANSDUCER>
<rule weight="0.8">
<TRANSDUCER>
<match lemma="mossèn" tags="n.*"><select lemma="curé"/></match>
<TRANSDUCER>
</rule>
...
<rule weight="1.0">
"main"
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
<TRANSDUCER>
<match tags="np.ant.*"/>
<LSRRECORD>
</rule>
<LSRRECORD>
<rule weight="1.0">
<LSRRECORD>
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
<match tags=""/>
</rule>
</pre>
</pre>


The last rule can be improved specifying that the unknown word should be capitalized:
==Todo==

<pre>
<rule weight="1.0">
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
<match tags="" case="Aa"/>
</rule>
</pre>

=== Some more new stuff ===
* [https://github.com/apertium/apertium-lex-tools/commit/8f5493f28d944c9c1591e3449f6df8c4718ab2c6 contains]
* [https://github.com/apertium/apertium-lex-tools/commit/371b3740e74c27a2e6ff4278ce9286ee9e0b2319 suffix]
* [https://github.com/apertium/apertium-lex-tools/pull/92 <code>glob="star"</code>]

==Writing and generating rules==

===Writing===
{{main|How to get started with lexical selection rules}}
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

===Generating===

;Parallel corpus
{{main|Learning rules from parallel and non-parallel corpora}}

;Monolingual corpora

{{main|Running_the_monolingual_rule_learning}}

==Todo and bugs==


* <s>xml compiler</s>
* <s>xml compiler</s>
* <s>compile rule operation patterns, as well as matching patterns</s>
* <s>compile rule operation patterns, as well as matching patterns</s>
* <s>make rules with gaps work</s>
* <s>make rules with gaps work</s>
* optimal coverage
* <s>optimal coverage</s>
* <s>fix bug with processing multiple sentences</s>
* <s>instead of having regex OR, insert separate paths/states.</s>
* <s>optimise the bestPath function (don't use strings to store the paths)</s>
* <s>autotoolsise build</s>
* <s>add option to compiler to spit out ATT transducers</s>
* <s>fix bug with outputting an extra '\n' at the end</s>
* <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s>
* profiling and speed up
** <s>why do the regex transducers have to be minimised ?</s>
** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s>
** <s>stop using string processing to retrieve rule numbers</s>
** <s>retrieve vector of vectors of words, not string of words from lttoolbox</s>
** why does the performance drop substantially with more rules ?
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
* <s>default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.</s>
* make sure that <code>-b</code> works with <code>-n</code> too.
* testing
* null flush
* add option to processor to spit out ATT transducers
* use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
* https://sourceforge.net/p/apertium/tickets/64/ <code><match tags="n.*"></match></code> never matches, while <code><match tags="n.*"/></code> does

; Rendimiento

* 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
* 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

==Preparedness of language pairs==

{|class="wikitable"
! Pair !! LR (L) !! LR (L→R) !! Fertility !! Rules
|-
| <code>apertium-is-en</code> || 18,563 || 22,220 || 1.19 || 115
|-
| <code>apertium-es-fr</code> || || || ||
|-
| <code>apertium-eu-es</code> || 16,946 || 18,550 || 1.09 || 250
|-
| <code>apertium-eu-en</code> || || || ||
|-
| <code>apertium-br-fr</code> || 20,489 || 20,770 || 1.01 || 256
|-
| <code>apertium-mk-en</code> || 8,568 || 10,624 || 1.24 || 81
|-
| <code>apertium-es-pt</code> || || || ||
|-
| <code>apertium-es-it</code> || || || ||
|-
| <code>apertium-es-ro</code> || || || ||
|-
| <code>apertium-en-es</code> || 267,469 || 268,522 || 1.003 || 334
|-
| <code>apertium-en-ca</code> || || || ||
|-
|}


===Troubleshooting===
If you get the message <code>lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory</code> you may need to put this in your ~/.bashrc
<pre>
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
</pre>
Then open a new terminal before using lrx-comp/lrx-proc.

On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message <code>/usr/bin/ld: cannot find -lz</code>, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.


==See also==
==See also==


* [[How to get started with lexical selection rules]]
* [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/apertium-lex-tools SVN Module: apertium-lex-tools]
* [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools/ SVN Module: apertium-lex-tools]

==References==

* Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "[https://rua.ua.es/dspace/bitstream/10045/27581/1/tyers12a.pdf Flexible finite-state lexical selection for rule-based machine translation]". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12
* Tyers et al [https://rua.ua.es/dspace/bitstream/10045/35848/1/thesis_FrancisMTyers.pdf#page=62 Feasible lexical selection for rule-based machine translation]
* Tyers et al [https://aclanthology.org/W15-4919.pdf Unsupervised training of maximum-entropy models for lexical selection in rule-based machine translation]


[[Category:Lexical selection]]
[[Category:Lexical selection]]
[[Category:Documentation in English]]

Latest revision as of 15:26, 11 April 2023

apertium-lex-tools provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.

Installing[edit]

Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.

See Installation, for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.

Lexical transfer in the pipeline[edit]

lrx-proc runs between bidix lookup and the first stage of transfer, e.g.

… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \
  | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x  kaz-tat.t1x.bin | …

This is the output of lt-proc -b on an ambiguous bilingual dictionary:

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujosa és la tardor, i la més seca l'estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.

Usage[edit]

Make a simple rule file,

<rules>
  <rule>
    <match lemma="criminal" tags="adj"/>
    <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
  </rule>
</rules>

Then compile it:

$ lrx-comp rules.xml rules.fst
1: 32@32

The input is the output of lt-proc -b,

$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ 
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst 
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG>
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$

Rule format[edit]

A rule is made up of an ordered list of:

  • Matches
  • Operations (select, remove)
<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>

Weights[edit]

The rules compete with each other. That is why a weight is assigned to each of them. In the case of a word that has several possible translations in the dictionary, all rules are evaluated. For each possible translation, the weights of the rules that match the context of the use of the word in the sentence are added up, and the translation with the highest value is chosen. For instance, let's consider these two rules:

  <rule weight="0.8">
    <match lemma="ferotge" tags="adj.*"><select lemma="farouche"/></match>
  </rule>
  <rule weight="1.0">
    <or>
      <match lemma="animal" tags="n.*"/>
      <match lemma="animau" tags="n.*"/>
    </or>
    <match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match>
  </rule>

If we have "un animal ferotge", the translation "farouche" will get 0.8 points, and "féroce" will get 1.0. The latter will be chosen.

Operator OR[edit]

The boolean operator OR can be used, as shown in the previous example:

  <rule weight="1.0">
    <or>
      <match lemma="animal" tags="n.*"/>
      <match lemma="animau" tags="n.*"/>
    </or>
    <match lemma="ferotge" tags="adj.*"><select lemma="féroce"/></match>
  </rule>

Sequences[edit]

Often, the same words are used in OR's. For readability and maintainability, they can be defined in a special sequence bloc, for instance:

  <def-seqs>
    <def-seq n="jorns"><or>
      <match lemma="diluns" tags="n.*"/>
      <match lemma="dimars" tags="n.*"/>
      <match lemma="dimècres" tags="n.*"/>
      <match lemma="dimèrcs" tags="n.*"/>
      <match lemma="dijòus" tags="n.*"/>
      <match lemma="dijaus" tags="n.*"/>
      <match lemma="divendres" tags="n.*"/>
      <match lemma="divés" tags="n.*"/>
      <match lemma="dissabte" tags="n.*"/>
      <match lemma="dimenge" tags="n.*"/>
    </or></def-seq>

    <def-seq n="meses"><or>
      <match lemma="genèr" tags="n.*"/>
      <match lemma="genièr" tags="n.*"/>
      <match lemma="janvièr" tags="n.*"/>
      <match lemma="gèr" tags="n.*"/>
      <match lemma="febrièr" tags="n.*"/>
      <match lemma="heurèr" tags="n.*"/>
      <match lemma="hrevèr" tags="n.*"/>
      <match lemma="herevèr" tags="n.*"/>
      <match lemma="herbèr" tags="n.*"/>
      <match lemma="hiurèr" tags="n.*"/>
      <match lemma="març" tags="n.*"/>
      <match lemma="abrial" tags="n.*"/>
      <match lemma="abril" tags="n.*"/>
      <match lemma="abriu" tags="n.*"/>
      <match lemma="abrieu" tags="n.*"/>
      <match lemma="mai" tags="n.*"/>
      <match lemma="junh" tags="n.*"/>
      <match lemma="julh" tags="n.*"/>
      <match lemma="juin" tags="n.*"/>
      <match lemma="gulh" tags="n.*"/>
      <match lemma="julhet" tags="n.*"/>
      <match lemma="gulhet" tags="n.*"/>
      <match lemma="junhsèga" tags="n.*"/>
      <match lemma="agost" tags="n.*"/>
      <match lemma="aost" tags="n.*"/>
      <match lemma="setembre" tags="n.*"/>
      <match lemma="seteme" tags="n.*"/>
      <match lemma="octobre" tags="n.*"/>
      <match lemma="octòbre" tags="n.*"/>
      <match lemma="novembre" tags="n.*"/>
      <match lemma="noveme" tags="n.*"/>
      <match lemma="decembre" tags="n.*"/>
      <match lemma="deceme" tags="n.*"/>
    </or></def-seq>
  </def-seqs>

They have to be referenced in the rules as follows:

  <rule weight="1.0">
    <or>
      <seq n="jorns"/>
      <seq n="meses"/>
      <match lemma="prima" tags="n.*"/>
      <match lemma="estiu" tags="n.*"/>
      <match lemma="auton" tags="n.*"/>
      <match lemma="ivèrn" tags="n.*"/>
    </or>
    <match lemma="passat" tags="adj.*"><select lemma="dernier"/></match>
  </rule>

Note that if you add a <def-seqs> section and you had only a <rules> section already, then you'll need to put both inside of an <lrx> section:

  <lrx>
    <def-seqs>
    ...
    </def-seqs>
    <rules>
    ...
    </rules>
  </lrx>

Operator REPEAT[edit]

Imagine the translation of the Occitan word "còrn" that may be "corner" or "horn" (of an animal). We could have as a first version:

  <rule weight="0.8">
    <match lemma="còrn" tags="n.*"><select lemma="corner"/></match>
  </rule>
  <rule weight="1.0" >
    <match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
    <match lemma="de" tags="pr"/>
    <or>
      <seq n="animals"/>
    </or>
  </rule>

But this will not match if we have an adjective that follows "còrn" (usually adjectives follow the nouns in Occitan). We could add a rule like:

  <rule weight="1.0" >
    <match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
    <match tags="adj.*"/>
    <match lemma="de" tags="pr"/>
    <or>
      <seq n="animals"/>
    </or>
  </rule>

Using the operator REPEAT we can have a more compact way just expanding rule 2:

  <rule weight="1.0" >
    <match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
    <repeat from="0" upto="2">
      <match tags="adj.*"/>
    </repeat>
    <match lemma="de" tags="pr"/>
    <or>
      <seq n="animals"/>
    </or>
  </rule>

Note that now we are even accepting two adjectives after "còrn" instead of only one (without adding a fourth rule for dealing with two adjectives).

And, if we think that horn can be not only big, but also "very big", we can improve the rule this way:

  <rule weight="1.0" >
    <match lemma="còrn" tags="n.*"><select lemma="horn"/></match>
    <repeat from="0" upto="3">
      <or>
        <match tags="adv"/>
        <match tags="adj.*"/>
      </or>
    </repeat>
    <match lemma="de" tags="pr"/>
    <or>
      <seq n="animals"/>
    </or>
  </rule>

Next, a second REPEAT block could be added between the preposition "de" and the sequence to deal with the possible existence of determiners, adjectives, etc.

REPEAT hack[edit]

Sometimes, a lexical selection has unclear rules. For instance the Occitan noun "cosina" may be "(female) cousin" or "kitchen". We can decide that the latter is the most usual translation, so it will be the default. On the other hand, we will select "cousin" if there is another parent term nearby, such as "father", "mother" or "brother". For this we can do something like:

  <rule weight="0.8">
    <match lemma="cosina" tags="n.*"><select lemma="kitchen"/></match>
  </rule>
  <rule weight="1.0" >
    <match lemma="cosina" tags="n.*"><select lemma="cousin"/></match>
    <repeat from="0" upto="4">
      <or>
        <match tags=""/>
        <match tags="*"/>
      </or>
    </repeat>
    <or>
      <seq n="familia"/>
    </or>
  </rule>
  <rule weight="1.0" >
    <or>
      <seq n="familia"/>
    </or>
    <repeat from="0" upto="4">
      <or>
        <match tags=""/>
        <match tags="*"/>
      </or>
    </repeat>
    <match lemma="cosina" tags="n.*"><select lemma="cousin"/></match>
  </rule>

Rule 2 selects "cousin" if, at most, after four words there is a family word. Rule 3 does the same, but looking at up to 4 words in front. Note the OR operator within the REPEAT: <match tags="*"/> matches any known word (i.e. that gets a morphological analysis), while <match tags=""/> matches unknown words (i.e. without any morphological tag). Without the OR operation, the rules would try to match precisely a sequence of one unknown word followed by one known one.

Macros[edit]

A macro is a set of rules for a common purpose that can be used for several words. For instance, quite often a verb has a different translation if it is pronominal or not, or if it is transitive or not.

Let's take as a example the Occitan verb "recordar" that is usually translated into French as "rappeler" ("remind"), but as a pronominal verb it would be "(se) souvenir" ("remember"). The problem is that to recognise a pronominal context one needs quite a few rules to prove that there is a personal unstressed pronoun before (or after) the verb and that it has the same person and number as the verb. So a macro could be created like:

    <!-- p1 = verb oci, p2 = verb fra no pron, p3 = verb fra pron -->
    <def-macro n="verb_nopron_pron" npar="3">
      <rule weight="0.8">
        <match plemma="1" tags="vblex.*"><select plemma="2"/></match>
      </rule>
      <rule weight="1.0">
        <match tags="prn.pro.p1.*.sg"/>
        <match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
      </rule>
      <rule weight="1.0">
        <match tags="prn.pro.p1.*.sg"/>
        <match tags="prn.pro.*"/>
        <match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
      </rule>
      <rule weight="1.0">
        <match plemma="1" tags="vblex.*.p1.sg"><select plemma="3"/></match>
        <match tags="prn.enc.p1.*.sg"/>
      </rule>
      <rule weight="1.0">
        <match tags="prn.pro.p2.*.sg"/>
        <match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
      </rule>
      <rule weight="1.0">
        <match tags="prn.pro.p2.*.sg"/>
        <match tags="prn.pro.*"/>
        <match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
      </rule>
      <rule weight="1.0">
        <match plemma="1" tags="vblex.*.p2.sg"><select plemma="3"/></match>
        <match tags="prn.enc.p2.*.sg"/>
      </rule>

    </def-macro>

The call to the rule should be:

  <macro n="verb_nopron_pron"><with-param v="recordar"/><with-param v="rappeler"/><with-param v="souvenir"/></macro>

For other verbs a call to the same macro is sufficient. The code is much more readable and maintainable than without macros.

Special cases[edit]

Matching a capitalized word[edit]

Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. cour<g><b/>d'assises</g>) will be elected:

  <rule weight="0.8">
    <match lemma="audiència" tags="n.*"><select lemma="audience"/></match>
  </rule>
  <rule weight="1.0">
    <match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match>
  </rule>

Matching an unknown word[edit]

Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:

  <rule weight="0.8">
    <match lemma="mossèn" tags="n.*"><select lemma="curé"/></match>
  </rule>
  <rule weight="1.0">
    <match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
    <match tags="np.ant.*"/>
  </rule>
  <rule weight="1.0">
    <match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
    <match tags=""/>
  </rule>

The last rule can be improved specifying that the unknown word should be capitalized:

  <rule weight="1.0">
    <match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
    <match tags="" case="Aa"/>
  </rule>

Some more new stuff[edit]

Writing and generating rules[edit]

Writing[edit]

Main article: How to get started with lexical selection rules

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

Generating[edit]

Parallel corpus
Main article: Learning rules from parallel and non-parallel corpora
Monolingual corpora
Main article: Running_the_monolingual_rule_learning

Todo and bugs[edit]

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage
  • fix bug with processing multiple sentences
  • instead of having regex OR, insert separate paths/states.
  • optimise the bestPath function (don't use strings to store the paths)
  • autotoolsise build
  • add option to compiler to spit out ATT transducers
  • fix bug with outputting an extra '\n' at the end
  • edit transfer.cc to allow input from lt-proc -b
  • profiling and speed up
    • why do the regex transducers have to be minimised ?
    • retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths
    • stop using string processing to retrieve rule numbers
    • retrieve vector of vectors of words, not string of words from lttoolbox
    • why does the performance drop substantially with more rules ?
    • add a pattern -> first letter map so we don't have to call recognise() with every transition (didn't work so well)
  • there is a problem with the regex recognition code: see bug1 in testing.
  • there is a problem with two defaults next to each other; bug2 in testing.
  • default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in testing/.
  • make sure that -b works with -n too.
  • testing
  • null flush
  • add option to processor to spit out ATT transducers
  • use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
  • https://sourceforge.net/p/apertium/tickets/64/ <match tags="n.*"></match> never matches, while <match tags="n.*"/> does
Rendimiento
  • 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
  • 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

Preparedness of language pairs[edit]

Pair LR (L) LR (L→R) Fertility Rules
apertium-is-en 18,563 22,220 1.19 115
apertium-es-fr
apertium-eu-es 16,946 18,550 1.09 250
apertium-eu-en
apertium-br-fr 20,489 20,770 1.01 256
apertium-mk-en 8,568 10,624 1.24 81
apertium-es-pt
apertium-es-it
apertium-es-ro
apertium-en-es 267,469 268,522 1.003 334
apertium-en-ca


Troubleshooting[edit]

If you get the message lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory you may need to put this in your ~/.bashrc

LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"

Then open a new terminal before using lrx-comp/lrx-proc.

On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message /usr/bin/ld: cannot find -lz, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.

See also[edit]

References[edit]