Difference between revisions of "Constraint-based lexical selection module"
Line 172: | Line 172: | ||
* testing |
* testing |
||
* null flush |
* null flush |
||
⚫ | |||
* add option to processor to spit out ATT transducers |
* add option to processor to spit out ATT transducers |
||
* profiling and speed up |
* profiling and speed up |
||
Line 182: | Line 181: | ||
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well) |
** <s>add a pattern -> first letter map so we don't have to call recognise() with every transition</s> (didn't work so well) |
||
** each state with >10 out transitions could have a char-transition list |
** each state with >10 out transitions could have a char-transition list |
||
* there is a problem with the regex recognition code: see |
* <s>there is a problem with the regex recognition code: see bug1 in <code>testing</code>.</s> |
||
* <s>there is a problem with two defaults next to each other; bug2 in <code>testing</code>.</s> |
|||
⚫ | |||
; Rendimiento |
; Rendimiento |
Revision as of 21:33, 10 September 2012
Lexical transfer
This is the output of lt-proc -b
on an ambiguous bilingual dictionary.
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
I.e.
L'estació més plujós és el tardor, i la més sec el estiu
Goes to:
The season/station more rainy is the autumn/fall, and the more dry the summer.
The module requires VM for transfer, or using apertium-transfer -b
in order to work.
Compilation
Check out the code from
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-tools
you can make it using:
$ ./autogen.sh $ make
Usage
Make a simple rule file,
<rules> <rule> <match lemma="criminal" tags="adj"/> <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match> </rule> </rules>
Then compile it:
$ ./apertium-lrx-comp rules.xml rules.fst 1 Written 1 rules, 3 patterns.
The input is the output of lt-proc -b
,
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./apertium-lrx-proc -t rules2.fst 1:SELECT:1 juzgado<n><m><sg> ^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
Rule format
A rule is made up of an ordered list of:
- Matches
- Operations (select, remove)
<rule> <match lemma="el"/> <match lemma="dona" tags="n.*"> <select lemma="wife"/> </match> <match lemma="de"/> </rule> <pre> <rule> <match lemma="estació" tags="n.*"> <select lemma="season"/> </match> <match lemma="més"/> <match lemma="plujós"/> </rule> <rule> <match lemma="guanyador"/> <match lemma="de"/> <match/> <match lemma="prova" tags="n.*"> <select lemma="event"/> </match> </rule>
Rule application process
- Optimal application
We're interested in the longest match, but not left to right, so what we do is make an automata of the rule contexts (one rule is one transducer, then we compose them), and we read through them, each state is an LU, It needs to be non-deterministic, and you keep a log of alive paths/states, but also their "weight" (how many transitions have been made) -- the longest for each of the ambiguous words is the winner when we get to the end of the sentence.
Writing and generating rules
Writing
- Main article: How to get started with lexical selection rules
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.
Generating
- Parallel corpus
- Main article: Generating lexical-selection rules from a parallel corpus
- Monolingual corpora
Compiled
The general structure is as follows:
LSSRECORD = id, len, weight; <ALPHABET> <NUM_TRANSDUCERS> <TRANSDUCER> <TRANSDUCER> <TRANSDUCER> ... "main" <TRANSDUCER> <LSRRECORD> <LSRRECORD> <LSRRECORD>
Todo and bugs
xml compilercompile rule operation patterns, as well as matching patternsmake rules with gaps workoptimal coveragefix bug with processing multiple sentencesinstead of having regex OR, insert separate paths/states.optimise the bestPath function (don't use strings to store the paths)autotoolsise buildadd option to compiler to spit out ATT transducersfix bug with outputting an extra '\n' at the endedittransfer.cc
to allow input fromlt-proc -b
- make sure that
-b
works with-n
too. - testing
- null flush
- add option to processor to spit out ATT transducers
- profiling and speed up
why do the regex transducers have to be minimised ?retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths- stop using string processing to retrieve rule numbers
- retrieve vector of vectors of words, not string of words from lttoolbox
- why does the performance drop substantially with more rules ?
add a pattern -> first letter map so we don't have to call recognise() with every transition(didn't work so well)- each state with >10 out transitions could have a char-transition list
there is a problem with the regex recognition code: see bug1 intesting
.there is a problem with two defaults next to each other; bug2 intesting
.- default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 in
testing/
.
- Rendimiento
- 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
- 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)
Preparedness of language pairs
Pair | LR (L) | LR (L→R) | Fertility | Rules |
---|---|---|---|---|
apertium-is-en |
18,563 | 22,220 | 1.19 | 115 |
apertium-es-fr |
||||
apertium-eu-es |
16,946 | 18,550 | 1.09 | 250 |
apertium-eu-en |
||||
apertium-br-fr |
20,489 | 20,770 | 1.01 | 256 |
apertium-mk-en |
8,568 | 10,624 | 1.24 | 81 |
apertium-es-pt |
||||
apertium-es-it |
||||
apertium-es-ro |
||||
apertium-en-es |
267,469 | 268,522 | 1.003 | 334 |
apertium-en-ca |
See also
References
- Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "Flexible finite-state lexical selection for rule-based machine translation". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12