Difference between revisions of "Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
(22 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 
{{TOCD}}
 
{{TOCD}}
   
  +
'''apertium-lex-tools''' provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.
==Lexical transfer==
 
   
  +
==Installing==
  +
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.
   
  +
<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.</span>
This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary.
 
   
 
==Lexical transfer in the pipeline==
  +
  +
lrx-proc runs between bidix lookup and the first stage of transfer, e.g.
 
<pre>
  +
… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \
  +
| apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | …
 
</pre>
  +
 
This is the output of <code>lt-proc -b</code> on an ambiguous bilingual dictionary:
 
<pre>
 
<pre>
 
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$
 
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$
Line 21: Line 32:
 
I.e.
 
I.e.
 
<pre>
 
<pre>
L'estació més plujós és el tardor, i la més sec el estiu
+
L'estació més plujosa és la tardor, i la més seca l'estiu
 
</pre>
 
</pre>
   
Line 29: Line 40:
 
</pre>
 
</pre>
   
  +
Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.
The module requires [[VM for transfer]], or using <code>apertium-transfer -b</code> in order to work.
 
 
==Compilation==
 
 
Check out the code from
 
 
<pre>
 
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-tools
 
</pre>
 
 
you can make it using:
 
 
<pre>
 
$ ./autogen.sh
 
$ make
 
</pre>
 
   
 
==Usage==
 
==Usage==
Line 62: Line 58:
   
 
<pre>
 
<pre>
$ ./apertium-lrx-comp rules.xml rules.fst
+
$ lrx-comp rules.xml rules.fst
  +
1: 32@32
1
 
Written 1 rules, 3 patterns.
 
 
</pre>
 
</pre>
   
Line 72: Line 67:
 
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
 
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$
 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./apertium-lrx-proc -t rules2.fst
+
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst
1:SELECT:1 juzgado<n><m><sg>
+
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG>
 
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
 
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$
 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
Line 94: Line 89:
 
</rule>
 
</rule>
   
<pre>
 
 
<rule>
 
<rule>
 
<match lemma="estació" tags="n.*">
 
<match lemma="estació" tags="n.*">
Line 113: Line 107:
 
</pre>
 
</pre>
   
  +
===Special cases===
  +
====Matching a capitalized word====
  +
  +
Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. <nowiki>cour<g><b/>d'assises</g></nowiki>) will be elected:
  +
 
<pre>
  +
<rule weight="0.8">
  +
<match lemma="audiència" tags="n.*"><select lemma="audience"/></match>
  +
</rule>
  +
<rule weight="1.0">
  +
<match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match>
  +
</rule>
 
</pre>
  +
  +
====Matching an unknown word====
  +
  +
Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:
  +
 
<pre>
  +
<rule weight="0.8">
  +
<match lemma="mossèn" tags="n.*"><select lemma="curé"/></match>
  +
</rule>
  +
<rule weight="1.0">
  +
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
  +
<match tags="np.ant.*"/>
  +
</rule>
  +
<rule weight="1.0">
  +
<match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
  +
<match tags=""/>
  +
</rule>
 
</pre>
   
 
==Writing and generating rules==
 
==Writing and generating rules==
Line 123: Line 148:
   
 
;Parallel corpus
 
;Parallel corpus
{{main|Generating lexical-selection rules from a parallel corpus}}
+
{{main|Learning rules from parallel and non-parallel corpora}}
   
 
;Monolingual corpora
 
;Monolingual corpora
   
  +
{{main|Running_the_monolingual_rule_learning}}
===Compiled===
 
 
The general structure is as follows:
 
 
<pre>
 
 
LSSRECORD = id, len, weight;
 
 
<ALPHABET>
 
<NUM_TRANSDUCERS>
 
<TRANSDUCER>
 
<TRANSDUCER>
 
<TRANSDUCER>
 
...
 
"main"
 
<TRANSDUCER>
 
<LSRRECORD>
 
<LSRRECORD>
 
<LSRRECORD>
 
</pre>
 
   
 
==Todo and bugs==
 
==Todo and bugs==
Line 170: Line 176:
 
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
 
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
 
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
 
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
* default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.
+
* <s>default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.</s>
 
* make sure that <code>-b</code> works with <code>-n</code> too.
 
* make sure that <code>-b</code> works with <code>-n</code> too.
 
* testing
 
* testing
 
* null flush
 
* null flush
 
* add option to processor to spit out ATT transducers
 
* add option to processor to spit out ATT transducers
  +
* use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
  +
* https://sourceforge.net/p/apertium/tickets/64/ <code><match tags="n.*"></match></code> never matches, while <code><match tags="n.*"/></code> does
   
 
; Rendimiento
 
; Rendimiento
Line 209: Line 217:
 
|-
 
|-
 
|}
 
|}
  +
  +
  +
===Troubleshooting===
  +
If you get the message <code>lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory</code> you may need to put this in your ~/.bashrc
 
<pre>
  +
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
  +
</pre>
  +
Then open a new terminal before using lrx-comp/lrx-proc.
  +
  +
On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message <code>/usr/bin/ld: cannot find -lz</code>, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.
   
 
==See also==
 
==See also==
   
 
* [[How to get started with lexical selection rules]]
 
* [[How to get started with lexical selection rules]]
* [https://apertium.svn.sourceforge.net/svnroot/apertium/branches/apertium-lex-tools SVN Module: apertium-lex-tools]
+
* [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-lex-tools/ SVN Module: apertium-lex-tools]
   
 
==References==
 
==References==

Revision as of 15:53, 1 March 2020

apertium-lex-tools provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.

Installing

Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.

See Installation, for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.

Lexical transfer in the pipeline

lrx-proc runs between bidix lookup and the first stage of transfer, e.g.

… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \
  | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x  kaz-tat.t1x.bin | …

This is the output of lt-proc -b on an ambiguous bilingual dictionary:

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujosa és la tardor, i la més seca l'estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.

Usage

Make a simple rule file,

<rules>
  <rule>
    <match lemma="criminal" tags="adj"/>
    <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
  </rule>
</rules>

Then compile it:

$ lrx-comp rules.xml rules.fst
1: 32@32

The input is the output of lt-proc -b,

$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ 
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst 
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG>
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$

Rule format

A rule is made up of an ordered list of:

  • Matches
  • Operations (select, remove)
<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>

Special cases

Matching a capitalized word

Below, the noun "audiència" will be usually translated as "audience", but if it is written as "Audiència", "cour# d'assises" (i.e. cour<g><b/>d'assises</g>) will be elected:

  <rule weight="0.8">
    <match lemma="audiència" tags="n.*"><select lemma="audience"/></match>
  </rule>
  <rule weight="1.0">
    <match lemma="Audiència" tags="n.*"><select lemma="cour# d'assises"/></match>
  </rule>

Matching an unknown word

Below, the noun "mossèn" will be usually translated as "curé", but if it is followed by an anthroponym (rule 2) or an unknown word (rule 3), "monseigneur" will be elected:

  <rule weight="0.8">
    <match lemma="mossèn" tags="n.*"><select lemma="curé"/></match>
  </rule>
  <rule weight="1.0">
    <match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
    <match tags="np.ant.*"/>
  </rule>
  <rule weight="1.0">
    <match lemma="mossèn" tags="n.m.sg"><select lemma="monseigneur"/></match>
    <match tags=""/>
  </rule>

Writing and generating rules

Writing

Main article: How to get started with lexical selection rules

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

Generating

Parallel corpus
Main article: Learning rules from parallel and non-parallel corpora
Monolingual corpora
Main article: Running_the_monolingual_rule_learning

Todo and bugs

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage
  • fix bug with processing multiple sentences
  • instead of having regex OR, insert separate paths/states.
  • optimise the bestPath function (don't use strings to store the paths)
  • autotoolsise build
  • add option to compiler to spit out ATT transducers
  • fix bug with outputting an extra '\n' at the end
  • edit transfer.cc to allow input from lt-proc -b
  • profiling and speed up
    • why do the regex transducers have to be minimised ?
    • retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths
    • stop using string processing to retrieve rule numbers
    • retrieve vector of vectors of words, not string of words from lttoolbox
    • why does the performance drop substantially with more rules ?
    • add a pattern -> first letter map so we don't have to call recognise() with every transition (didn't work so well)
  • there is a problem with the regex recognition code: see bug1 in testing.
  • there is a problem with two defaults next to each other; bug2 in testing.
  • default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in testing/.
  • make sure that -b works with -n too.
  • testing
  • null flush
  • add option to processor to spit out ATT transducers
  • use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
  • https://sourceforge.net/p/apertium/tickets/64/ <match tags="n.*"></match> never matches, while <match tags="n.*"/> does
Rendimiento
  • 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
  • 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

Preparedness of language pairs

Pair LR (L) LR (L→R) Fertility Rules
apertium-is-en 18,563 22,220 1.19 115
apertium-es-fr
apertium-eu-es 16,946 18,550 1.09 250
apertium-eu-en
apertium-br-fr 20,489 20,770 1.01 256
apertium-mk-en 8,568 10,624 1.24 81
apertium-es-pt
apertium-es-it
apertium-es-ro
apertium-en-es 267,469 268,522 1.003 334
apertium-en-ca


Troubleshooting

If you get the message lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory you may need to put this in your ~/.bashrc

LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"

Then open a new terminal before using lrx-comp/lrx-proc.

On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message /usr/bin/ld: cannot find -lz, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.

See also

References