Difference between revisions of "Constraint-based lexical selection module"

From Apertium
Jump to navigation Jump to search
Line 163: Line 163:
 
* add option to compiler and processor to spit out ATT transducers
 
* add option to compiler and processor to spit out ATT transducers
   
 
; Rendimiento
 
Rendimiento
 
   
 
* 2011-12-12: 10,000 words / 97 seconds = 103 words/sec
 
* 2011-12-12: 10,000 words / 97 seconds = 103 words/sec

Revision as of 22:03, 12 December 2011

Lexical transfer

This is the output of lt-proc -b on an ambiguous bilingual dictionary.

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ 
^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ 
^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ 
^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ 
^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

L'estació més plujós és el tardor, i la més sec el estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

The module requires VM for transfer, or another apertium transfer implementation without lexical transfer in order to work.

Rule format

A rule is made up of an ordered list of:

  • Matches
  • Operations (select, remove)
<rule>  
  <match lemma="el"/>  
  <match lemma="dona" tags="n.*">    
    <select lemma="wife"/> 
  </match>  
  <match lemma="de"/>
</rule>

<pre>
<rule>  
  <match lemma="estació" tags="n.*">    
    <select lemma="season"/> 
  </match>  
  <match lemma="més"/>
  <match lemma="plujós"/>
</rule>

<rule>  
  <match lemma="guanyador"/>
  <match lemma="de"/>
  <match/>
  <match lemma="prova" tags="n.*">    
    <select lemma="event"/> 
  </match>  
</rule>


Usage

$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$ 
With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest season 
is the 
autumn, and the 
driest the 
summer. 
With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest station 
is the 
autumn, and the 
driest the 
summer.

Rule application process

Optimal application

We're interested in the longest match, but not left to right, so what we do is make an automata of the rule contexts (one rule is one transducer, then we compose them), and we read through them, each state is an LU, It needs to be non-deterministic, and you keep a log of alive paths/states, but also their "weight" (how many transitions have been made) -- the longest for each of the ambiguous words is the winner when we get to the end of the sentence.

Writing and generating rules

Writing

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.


Generating

Compiled

The general structure is as follows:


LSSRECORD = id, len, weight;

<ALPHABET>
<NUM_TRANSDUCERS>
<TRANSDUCER>
<TRANSDUCER>
<TRANSDUCER>
...
"main"
<TRANSDUCER>
<LSRRECORD>
<LSRRECORD>
<LSRRECORD>

Todo

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage
  • fix bug with processing multiple sentences
  • instead of having regex OR, insert separate paths/states.
  • testing
  • optimise the bestPath function (don't use strings to store the paths)
  • edit transfer.cc to allow input from lt-proc -b
  • null flush
  • default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase)
  • add option to compiler and processor to spit out ATT transducers
Rendimiento
  • 2011-12-12: 10,000 words / 97 seconds = 103 words/sec

See also