Constraint-based lexical selection module

From Apertium
Jump to: navigation, search

Contents

apertium-lex-tools provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.

[edit] Installing

Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.

See Installation, for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.

[edit] Lexical transfer in the pipeline

lrx-proc runs between bidix lookup and the first stage of transfer, e.g.

… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \
  | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x  kaz-tat.t1x.bin | …

This is the output of lt-proc -b on an ambiguous bilingual dictionary:

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujosa és la tardor, i la més seca l'estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.

[edit] Usage

Make a simple rule file,

<rules>
  <rule>
    <match lemma="criminal" tags="adj"/>
    <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
  </rule>
</rules>

Then compile it:

$ lrx-comp rules.xml rules.fst
1: 32@32

The input is the output of lt-proc -b,

$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ 
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst 
1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG>
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$

[edit] Rule format

A rule is made up of an ordered list of:

  • Matches
  • Operations (select, remove)
<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>

[edit] Writing and generating rules

[edit] Writing

Main article: How to get started with lexical selection rules

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

[edit] Generating

Parallel corpus
Main article: Learning rules from parallel and non-parallel corpora
Monolingual corpora
Main article: Running_the_monolingual_rule_learning

[edit] Todo and bugs

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage
  • fix bug with processing multiple sentences
  • instead of having regex OR, insert separate paths/states.
  • optimise the bestPath function (don't use strings to store the paths)
  • autotoolsise build
  • add option to compiler to spit out ATT transducers
  • fix bug with outputting an extra '\n' at the end
  • edit transfer.cc to allow input from lt-proc -b
  • profiling and speed up
    • why do the regex transducers have to be minimised ?
    • retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths
    • stop using string processing to retrieve rule numbers
    • retrieve vector of vectors of words, not string of words from lttoolbox
    • why does the performance drop substantially with more rules ?
    • add a pattern -> first letter map so we don't have to call recognise() with every transition (didn't work so well)
  • there is a problem with the regex recognition code: see bug1 in testing.
  • there is a problem with two defaults next to each other; bug2 in testing.
  • default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in testing/.
  • make sure that -b works with -n too.
  • testing
  • null flush
  • add option to processor to spit out ATT transducers
  • use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
  • https://sourceforge.net/p/apertium/tickets/64/ <match tags="n.*"></match> never matches, while <match tags="n.*"/> does
Rendimiento
  • 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
  • 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

[edit] Preparedness of language pairs

Pair LR (L) LR (L→R) Fertility Rules
apertium-is-en 18,563 22,220 1.19 115
apertium-es-fr
apertium-eu-es 16,946 18,550 1.09 250
apertium-eu-en
apertium-br-fr 20,489 20,770 1.01 256
apertium-mk-en 8,568 10,624 1.24 81
apertium-es-pt
apertium-es-it
apertium-es-ro
apertium-en-es 267,469 268,522 1.003 334
apertium-en-ca


[edit] Troubleshooting

If you get the message lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory you may need to put this in your ~/.bashrc

LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"

Then open a new terminal before using lrx-comp/lrx-proc.

On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message /usr/bin/ld: cannot find -lz, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.

[edit] See also

[edit] References

Personal tools