Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

Constraint-based lexical selection module

From Apertium
Revision as of 12:44, 30 November 2011 by Jacob Nordfalk (Talk | contribs)

Jump to: navigation, search

Contents

Lexical transfer

This is the output of lt-proc -b on an ambiguous bilingual dictionary.

[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$

I.e.

El estació més plujós ser el tardor, i el més sec el estiu

Goes to:

The season/station more rainy is the autumn/fall, and the more dry the summer.

The module requires VM for transfer, or another apertium transfer implementation without lexical transfer in order to work.

Rule format

A rule is made up of:

  • An action (select, remove)
  • A "centre" (the source language token that will be treated)
  • A target language pattern on which the action takes place
  • A source language context

Text

s	("estació" n)	("season" n)	(1 "plujós")
s	("estació" n)	("season" n)	(2 "plujós")
s	("estació" n)	("season" n)	(1 "de") (3 "any")
s	("estació" n)	("station" n)	(1 "de") (3 "Línia")
s	("prova" n)	("evidence" n)	(1 "arqueològic")
s	("prova" n)	("test" n)	(1 "estadístic")
s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
s	("prova" n)	("testing" n)	(-2 "tècnica") (-1 "de") 
s	("joc" n)	("game" n)	(1 "olímpic")
s	("joc" n)	("set" n)	(1 "de") (2 "caràcter")
r	("pista" n)	("hint" n)	(1 "més") (2 "llarg")
r	("pista" n)	("clue" n)	(1 "més") (2 "llarg")
r	("motiu" n)	("motif" n)	(-1 "aquest") (-2 "per")
s	("carn" n)	("flesh" n)	(1 "i") (2 "os")
s	("sobre" pr)	("over" n)	(-1 "victòria")
s       ("dona" n)      ("wife" n)      (-1 "*" det pos)
s       ("dona" n)      ("wife" n)      (-1 "el") (1 "de")
s       ("dona" n)      ("woman" n)     (1 "de") (2 "*" det pos) (3 "somni")
r       ("patró n)      ("pattern" n)   (1 "*" np ant)

Usage

$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null
^El<det><def><f><sg>/The<det><def><f><sg>$ 
^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ 
^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ 
^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ 
^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$ 
With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest season 
is the 
autumn, and the 
driest the 
summer. 
With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\
   apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin

The 
rainiest station 
is the 
autumn, and the 
driest the 
summer.

XML

Rule application process

The following is an inefficient implementation of the rule application process:


# s	("prova" n)	("event" n)	(-3 "guanyador") (-2 "de") 
#
# tipus = "select";
# centre = "^prova<n>.*"
# tl_patro = ["^event<n>.*"]
# sl_patro = {-3: "^guanyador<", -2: "^de<"}

CLASS Rule: 
        tipus = enum('select', 'remove')
        centre = '';
        tl_patro = [];
        sl_patro = {};


rule_table = {}; # e.g. rule_table["estació"] = [rule1, rule2, rule3];
i = 0

DEFINE ApplyRule(rule, lu): 
    

    FOREACH target IN lu.tl: 
        SWITCH rule.tipus:
            'select': 
                 IF target NOT IN rule.tl_patro:
                     DELETE target
            'remove': 
                 IF target IN rule.tl_patro:
                     DELETE target



FOREACH pair(sl, tl) IN sentence:
   
    FOREACH centre IN rule_table: 

        IF centre IN sl: 

            FOREACH rule IN rule_table[centre]: 

                matched = False   

                FOREACH context_item IN rule_table[centre][rule]: 

                  IF context_item in sentence: 
                      matched = True
                  ELSE:
                      matched = False
                
                # If all of the context items have matched, and none of them have not matched
                # if a rule matches break and continue to the pair. 
                IF matched == True:
 
                      sentence[i] = ApplyRule(rule_table[centre][rule], sentence[i])
                      break 

    i = i + 1


Optimal application

We're interested in the longest match, but not left to right, so what we do is make an automata of the rule contexts (one rule is one transducer, then we compose them), and we read through them, each state is an LU, It needs to be non-deterministic, and you keep a log of alive paths/states, but also their "weight" (how many transitions have been made) -- the longest for each of the ambiguous words is the winner when we get to the end of the sentence.

Writing and generating rules

Writing

A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.


Generating

Rule formats

<rule>  
  <skip lemma="el"/>  
  <select lemma="dona" tags="n.*">    
    <acception lemma="wife"/> 
  </select>  
  <skip lemma="de"/>
</rule>

<rule>  
  <skip lemma="guanyador"/>
  <skip lemma="de"/>
  <skip/>
  <select lemma="prova" tags="n.*">    
    <acception lemma="event"/> 
  </select>  
</rule>

Compiled

The general structure is as follows:


LSSRECORD = id, len, weight;

<ALPHABET>
<NUM_TRANSDUCERS>
<TRANSDUCER>
<TRANSDUCER>
<TRANSDUCER>
...
"main"
<TRANSDUCER>
<LSRRECORD>
<LSRRECORD>
<LSRRECORD>

Todo

  • xml compiler
  • compile rule operation patterns, as well as matching patterns
  • make rules with gaps work
  • optimal coverage

See also

Personal tools