Difference between revisions of "Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
Line 161: Line 161:
* <s>fix bug with outputting an extra '\n' at the end</s>
* <s>fix bug with outputting an extra '\n' at the end</s>
* <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s>
* <s>edit <code>transfer.cc</code> to allow input from <code>lt-proc -b</code></s>
* make sure that <code>-b</code> works with <code>-n</code> too.
* testing
* null flush
* add option to processor to spit out ATT transducers
* profiling and speed up
* profiling and speed up
** <s>why do the regex transducers have to be minimised ?</s>
** <s>why do the regex transducers have to be minimised ?</s>
** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s>
** <s>retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths</s>
** stop using string processing to retrieve rule numbers
** <s>stop using string processing to retrieve rule numbers</s>
** retrieve vector of vectors of words, not string of words from lttoolbox
** <s>retrieve vector of vectors of words, not string of words from lttoolbox</s>
** why does the performance drop substantially with more rules ?
** why does the performance drop substantially with more rules ?
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well)
** each state with >10 out transitions could have a char-transition list
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s>
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s>
* default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.
* default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in <code>testing/</code>.
* make sure that <code>-b</code> works with <code>-n</code> too.
* testing
* null flush
* add option to processor to spit out ATT transducers


; Rendimiento
; Rendimiento

Revision as of 09:50, 9 October 2012

Lexical transfer

This is the output of lt-proc -b on an ambiguous bilingual dictionary.

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujós és el tardor, i la més sec el estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

The module requires VM for transfer, or using apertium-transfer -b in order to work.

Compilation

Check out the code from

$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-tools

you can make it using:

$ ./autogen.sh
$ make

Usage

Make a simple rule file,

<rules>
  <rule>
    <match lemma="criminal" tags="adj"/>
    <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
  </rule>
</rules>

Then compile it:

$ ./apertium-lrx-comp rules.xml rules.fst
1
Written 1 rules, 3 patterns.

The input is the output of lt-proc -b,

$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ 
^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./apertium-lrx-proc -t rules2.fst 
1:SELECT:1 juzgado<n><m><sg>
^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ 
^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$

Rule format

A rule is made up of an ordered list of:

  • Matches
  • Operations (select, remove)
<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<pre>
<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>


Writing and generating rules

Writing

Main article: How to get started with lexical selection rules

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.

Generating

Parallel corpus
Main article: Generating lexical-selection rules from a parallel corpus
Monolingual corpora

Compiled

The general structure is as follows:


LSSRECORD = id, len, weight;

<ALPHABET>
<NUM_TRANSDUCERS>
<TRANSDUCER>
<TRANSDUCER>
<TRANSDUCER>
...
"main"
<TRANSDUCER>
<LSRRECORD>
<LSRRECORD>
<LSRRECORD>

Todo and bugs

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage
  • fix bug with processing multiple sentences
  • instead of having regex OR, insert separate paths/states.
  • optimise the bestPath function (don't use strings to store the paths)
  • autotoolsise build
  • add option to compiler to spit out ATT transducers
  • fix bug with outputting an extra '\n' at the end
  • edit transfer.cc to allow input from lt-proc -b
  • profiling and speed up
    • why do the regex transducers have to be minimised ?
    • retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths
    • stop using string processing to retrieve rule numbers
    • retrieve vector of vectors of words, not string of words from lttoolbox
    • why does the performance drop substantially with more rules ?
    • add a pattern -> first letter map so we don't have to call recognise() with every transition (didn't work so well)
  • there is a problem with the regex recognition code: see bug1 in testing.
  • there is a problem with two defaults next to each other; bug2 in testing.
  • default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in testing/.
  • make sure that -b works with -n too.
  • testing
  • null flush
  • add option to processor to spit out ATT transducers
Rendimiento
  • 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
  • 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)

Preparedness of language pairs

Pair LR (L) LR (L→R) Fertility Rules
apertium-is-en 18,563 22,220 1.19 115
apertium-es-fr
apertium-eu-es 16,946 18,550 1.09 250
apertium-eu-en
apertium-br-fr 20,489 20,770 1.01 256
apertium-mk-en 8,568 10,624 1.24 81
apertium-es-pt
apertium-es-it
apertium-es-ro
apertium-en-es 267,469 268,522 1.003 334
apertium-en-ca

See also

References