Constraint-based lexical selection module
From Apertium
|
[edit] Lexical transfer
This is the output of lt-proc -b on an ambiguous bilingual dictionary.
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
I.e.
L'estació més plujós és el tardor, i la més sec l'estiu
Goes to:
The season/station more rainy is the autumn/fall, and the more dry the summer.
The module requires VM for transfer, or using apertium-transfer -b in order to work.
[edit] Compilation
Check out the code from
$ svn co https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-lex-tools
you can make it using:
$ ./autogen.sh $ make $ sudo make install
[edit] Troubleshooting
If you get the message lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory you may need to put this in your ~/.bashrc
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
Then open a new terminal before using lrx-comp/lrx-proc.
[edit] Usage
Make a simple rule file,
<rules>
<rule>
<match lemma="criminal" tags="adj"/>
<match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match>
</rule>
</rules>
Then compile it:
$ lrx-comp rules.xml rules.fst 1: 32@32
The input is the output of lt-proc -b,
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst 1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG> ^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
[edit] Rule format
A rule is made up of an ordered list of:
- Matches
- Operations (select, remove)
<rule>
<match lemma="el"/>
<match lemma="dona" tags="n.*">
<select lemma="wife"/>
</match>
<match lemma="de"/>
</rule>
<rule>
<match lemma="estació" tags="n.*">
<select lemma="season"/>
</match>
<match lemma="més"/>
<match lemma="plujós"/>
</rule>
<rule>
<match lemma="guanyador"/>
<match lemma="de"/>
<match/>
<match lemma="prova" tags="n.*">
<select lemma="event"/>
</match>
</rule>
[edit] Writing and generating rules
[edit] Writing
- Main article: How to get started with lexical selection rules
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.
[edit] Generating
- Parallel corpus
- Main article: Generating lexical-selection rules from a parallel corpus
- Monolingual corpora
[edit] Todo and bugs
-
xml compiler -
compile rule operation patterns, as well as matching patterns -
make rules with gaps work -
optimal coverage -
fix bug with processing multiple sentences -
instead of having regex OR, insert separate paths/states. -
optimise the bestPath function (don't use strings to store the paths) -
autotoolsise build -
add option to compiler to spit out ATT transducers -
fix bug with outputting an extra '\n' at the end -
edittransfer.ccto allow input fromlt-proc -b - profiling and speed up
-
why do the regex transducers have to be minimised ? -
retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the paths -
stop using string processing to retrieve rule numbers -
retrieve vector of vectors of words, not string of words from lttoolbox - why does the performance drop substantially with more rules ?
-
add a pattern -> first letter map so we don't have to call recognise() with every transition(didn't work so well)
-
-
there is a problem with the regex recognition code: see bug1 intesting. -
there is a problem with two defaults next to each other; bug2 intesting. -
default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 intesting/. - make sure that
-bworks with-ntoo. - testing
- null flush
- add option to processor to spit out ATT transducers
- Rendimiento
- 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
- 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)
[edit] Preparedness of language pairs
| Pair | LR (L) | LR (L→R) | Fertility | Rules |
|---|---|---|---|---|
apertium-is-en | 18,563 | 22,220 | 1.19 | 115 |
apertium-es-fr | ||||
apertium-eu-es | 16,946 | 18,550 | 1.09 | 250 |
apertium-eu-en | ||||
apertium-br-fr | 20,489 | 20,770 | 1.01 | 256 |
apertium-mk-en | 8,568 | 10,624 | 1.24 | 81 |
apertium-es-pt | ||||
apertium-es-it | ||||
apertium-es-ro | ||||
apertium-en-es | 267,469 | 268,522 | 1.003 | 334 |
apertium-en-ca |
[edit] See also
[edit] References
- Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "Flexible finite-state lexical selection for rule-based machine translation". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12

