Difference between revisions of "Constraint-based lexical selection module"
Line 6: | Line 6: | ||
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev. |
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev. |
||
<span style="color: #f00;">See [[Installation]], for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.</span> |
|||
See [[Installation]] for how to install (it's in the nightly repos). |
|||
==Lexical transfer in the pipeline== |
==Lexical transfer in the pipeline== |
Revision as of 12:50, 6 March 2016
apertium-lex-tools provides a module for compiling lexical selection rules and processing them in the pipeline. Rules can be manually written, or learnt from monolingual or parallel corpora.
Installing
Prerequisites and compilation are the same as lttoolbox and apertium, as well as (on Debian/Ubuntu) zlib1g-dev.
See Installation, for most real operating systems you can now get pre-built packages of apertium-lex-tools (as well as other core tools) through your regular package manager.
Lexical transfer in the pipeline
lrx-proc runs between bidix lookup and the first stage of transfer, e.g.
… apertium-pretransfer | lt-proc -b kaz-tat.autobil.bin | lrx-proc kaz-tat.lrx.bin \ | apertium-transfer -b apertium-kaz-tat.kaz-tat.t1x kaz-tat.t1x.bin | …
This is the output of lt-proc -b
on an ambiguous bilingual dictionary:
[74306] ^El<det><def><f><sg>/The<det><def><f><sg>$ ^estació<n><f><sg>/season<n><sg>/station<n><sg>$ ^més<preadv>/more<preadv>$ ^plujós<adj><f><sg>/rainy<adj><sint><f><sg>$ ^ser<vbser><pri><p3><sg>/be<vbser><pri><p3><sg>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^tardor<n><f><sg>/autumn<n><sg>/fall<n><sg>$^,<cm>/,<cm>$ ^i<cnjcoo>/and<cnjcoo>$ ^el<det><def><f><sg>/the<det><def><f><sg>$ ^més<preadv>/more<preadv>$ ^sec<adj><f><sg>/dry<adj><sint><f><sg>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^estiu<n><m><sg>/summer<n><sg>$^.<sent>/.<sent>$
I.e.
L'estació més plujosa és la tardor, i la més seca l'estiu
Goes to:
The season/station more rainy is the autumn/fall, and the more dry the summer.
Apertium/lttoolbox 3.3 and onwards support the -b option to lt-proc / apertium-transfer.
Usage
Make a simple rule file,
<rules> <rule> <match lemma="criminal" tags="adj"/> <match lemma="court" tags="n.*"><select lemma="juzgado" tags="n.*"/></match> </rule> </rules>
Then compile it:
$ lrx-comp rules.xml rules.fst 1: 32@32
The input is the output of lt-proc -b
,
$ echo "^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/corte<n><f><sg>/cancha<n><f><sg>/juzgado<n><m><sg>/tribunal<n><m><sg>$^.<sent>/.<sent>$" | ./lrx-proc -t rules.fst 1:SELECT<1>:court<n><sg>:<select>juzgado<n><ANY_TAG> ^There<adv>/Allí<adv>$ ^be<vbser><pri><p3><sg>/ser<vbser><pri><p3><sg>$ ^a<det><ind><sg>/uno<det><ind><GD><sg>$ ^criminal<adj>/criminal<adj><mf>/delictivo<adj>$ ^court<n><sg>/juzgado<n><m><sg>$^.<sent>/.<sent>$
Rule format
A rule is made up of an ordered list of:
- Matches
- Operations (select, remove)
<rule> <match lemma="el"/> <match lemma="dona" tags="n.*"> <select lemma="wife"/> </match> <match lemma="de"/> </rule> <rule> <match lemma="estació" tags="n.*"> <select lemma="season"/> </match> <match lemma="més"/> <match lemma="plujós"/> </rule> <rule> <match lemma="guanyador"/> <match lemma="de"/> <match/> <match lemma="prova" tags="n.*"> <select lemma="event"/> </match> </rule>
Writing and generating rules
Writing
- Main article: How to get started with lexical selection rules
A good way to start writing lexical selection rules is to take a corpus, and search for the problem word, you can then look at how the word should be translated, and the contexts it appears in.
Generating
- Parallel corpus
- Main article: Learning rules from parallel and non-parallel corpora
- Monolingual corpora
- Main article: Running_the_monolingual_rule_learning
Todo and bugs
xml compilercompile rule operation patterns, as well as matching patternsmake rules with gaps workoptimal coveragefix bug with processing multiple sentencesinstead of having regex OR, insert separate paths/states.optimise the bestPath function (don't use strings to store the paths)autotoolsise buildadd option to compiler to spit out ATT transducersfix bug with outputting an extra '\n' at the endedittransfer.cc
to allow input fromlt-proc -b
- profiling and speed up
why do the regex transducers have to be minimised ?retrieve vector of strings corresponding to paths, instead of a single string corresponding to all of the pathsstop using string processing to retrieve rule numbersretrieve vector of vectors of words, not string of words from lttoolbox- why does the performance drop substantially with more rules ?
add a pattern -> first letter map so we don't have to call recognise() with every transition(didn't work so well)
there is a problem with the regex recognition code: see bug1 intesting
.there is a problem with two defaults next to each other; bug2 intesting
.default to case insensitive ? (perhaps case insensitive for lower case, case sensitive for uppercase) -- see bug4 intesting/
.- make sure that
-b
works with-n
too. - testing
- null flush
- add option to processor to spit out ATT transducers
- use brown clusters to merge rules with the same context, or remove parts of context from rules which are not relevant?
- https://sourceforge.net/p/apertium/tickets/64/
<match tags="n.*"></match>
never matches, while<match tags="n.*"/>
does
- Rendimiento
- 2011-12-12: 10,000 words / 97 seconds = 103 words/sec (71290 words, 14.84 sec = 4803 words/sec)
- 2011-12-19: 10,000 words / 4 seconds = 2,035 words/sec (71290 words, 8 secs = 8911 words/sec)
Preparedness of language pairs
Pair | LR (L) | LR (L→R) | Fertility | Rules |
---|---|---|---|---|
apertium-is-en |
18,563 | 22,220 | 1.19 | 115 |
apertium-es-fr |
||||
apertium-eu-es |
16,946 | 18,550 | 1.09 | 250 |
apertium-eu-en |
||||
apertium-br-fr |
20,489 | 20,770 | 1.01 | 256 |
apertium-mk-en |
8,568 | 10,624 | 1.24 | 81 |
apertium-es-pt |
||||
apertium-es-it |
||||
apertium-es-ro |
||||
apertium-en-es |
267,469 | 268,522 | 1.003 | 334 |
apertium-en-ca |
Troubleshooting
If you get the message lrx-comp: error while loading shared libraries: libapertium3-3.2.so.0: cannot open shared object file: No such file or directory
you may need to put this in your ~/.bashrc
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
Then open a new terminal before using lrx-comp/lrx-proc.
On a 64-bit machine, apertium-lex-tools make may fail because the zlib is missing, even though you have zlib1g-dev installed. If you get the error message /usr/bin/ld: cannot find -lz
, do the following: install package lib32z1-dev (which will install many other dependencies), even though it is a 32-bit binary, it is needed to compile the sources.
See also
References
- Tyers, F. M., Sánchez-Martínez, F., Forcada, M. L. (2012) "Flexible finite-state lexical selection for rule-based machine translation". Proceedings of the 17th Annual Conference of the European Association of Machine Translation, EAMT12