Constraint-based lexical selection module
Lexical transfer
This is the output of lt-proc -b
on an ambiguous bilingual dictionary.
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
I.e.
El estació més plujós ser el tardor, i el més sec el estiu
Goes to:
The season/station more rainy is the autumn/fall, and the more dry the summer.
The module requires VM for transfer, or another apertium transfer implementation without lexical transfer in order to work.
Rule format
A rule is made up of:
- An action (select, remove)
- A "centre" (the source language token that will be treated)
- A target language pattern on which the action takes place
- A source language context
Text
s ("estació" n) ("season" n) (1 "plujós") s ("estació" n) ("season" n) (2 "plujós") s ("estació" n) ("season" n) (1 "de") (3 "any") s ("estació" n) ("station" n) (1 "de") (3 "Línia") s ("prova" n) ("evidence" n) (1 "arqueològic") s ("prova" n) ("test" n) (1 "estadístic") s ("prova" n) ("event" n) (-3 "guanyador") (-2 "de") s ("prova" n) ("testing" n) (-2 "tècnica") (-1 "de") s ("joc" n) ("game" n) (1 "olímpic") s ("joc" n) ("set" n) (1 "de") (2 "caràcter") r ("pista" n) ("hint" n) (1 "més") (2 "llarg") r ("pista" n) ("clue" n) (1 "més") (2 "llarg") r ("motiu" n) ("motif" n) (-1 "aquest") (-2 "per") s ("carn" n) ("flesh" n) (1 "i") (2 "os") s ("sobre" pr) ("over" n) (-1 "victòria") s ("dona" n) ("wife" n) (-1 "*" det pos) s ("dona" n) ("wife" n) (-1 "el") (1 "de") s ("dona" n) ("woman" n) (1 "de") (2 "*" det pos) (3 "somni") r ("patró n) ("pattern" n) (1 "*" np ant)
Usage
$ cat /tmp/test | python apertium-lex-rules.py rules.txt 2>/dev/null ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$ ^.<sent>/.<sent>$
- With rules
$ cat /tmp/test | python apertium-lex-rules.py rules.txt | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\ apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin The rainiest season is the autumn, and the driest the summer.
- With bilingual dictionary defaults
$ cat /tmp/test | apertium-lex-defaults ca-en.autoldx.bin | apertium-vm -c ca-en.t1x.vmb | apertium-vm -c ca-en.t2x.vmb |\ apertium-vm -c ca-en.t3x.vmb | lt-proc -g ca-en.autogen.bin The rainiest station is the autumn, and the driest the summer.
XML
Rule application process
- Optimal application
We're interested in the longest match, but not left to right, so what we do is make an automata of the rule contexts (one rule is one transducer, then we compose them), and we read through them, each state is an LU, It needs to be non-deterministic, and you keep a log of alive paths/states, but also their "weight" (how many transitions have been made) -- the longest for each of the ambiguous words is the winner when we get to the end of the sentence.
Writing and generating rules
Writing
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.
Generating
Jacob's critique of the rule format
By 'skip' you actually means 'match'. And lemma selection should be called <select> :-)
So I'd write
<rule> <match lemma="el"/> <match lemma="dona" tags="n.*"> <select lemma="wife"/> </match> <match lemma="de"/> </rule>
and actually you would like to work on categories as well as lemmas.
Like, to prefer (human) beings and not things before "feel".
<rule> <match tags="n.*"> <select cat="beings"/> </match> <match lemma="feel"/> </rule>
--Jacob Nordfalk 12:22, 30 November 2011 (UTC)
Compiled
The general structure is as follows:
LSSRECORD = id, len, weight; <ALPHABET> <NUM_TRANSDUCERS> <TRANSDUCER> <TRANSDUCER> <TRANSDUCER> ... "main" <TRANSDUCER> <LSRRECORD> <LSRRECORD> <LSRRECORD>
Todo
xml compilercompile rule operation patterns, as well as matching patternsmake rules with gaps work- optimal coverage