Automatically trimming a monodix
There used to be a problem in Apertium regarding copies. When we came to make a new language pair based off some existing resource in Apertium (e.g. a monodix), we would make a copy of that resource, and then change it as we need. This was less than ideal because:
- it meant that any improvements we made weren't automatically carried over to the dictionary we copied from
- it meant that any improvements in the dictionary we copied from weren't carried over into our new dictionary.
So why did we do it ? -- Testvoc. If we had entries in our monodix that were not in our bidix, then we got lots of @
in our output. This is bad.
One way we got around the problem was to use ad hoc scripts for trimming the dictionaries (see for example: apertium-af-nl, apertium-sme-nob and the trim-lexc.py
script in trunk/apertium-tools
), but these were less than ideal because usually they had to include specific hacks for differences in dictionary format.
The new and improved way is to take the intersection of our monodix and our bidix binaries, and use that for analysis. That's what is described below.
Note: This doesn't take care of the whole testvoc problem. It is still necessary to get rid of #
symbols, which often arise due to bugs in transfer code.
Example
Suppose you have the monodix:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> <pardef n="beer__n"> <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e> <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="beer"><i>beer</i><par n="beer__n"/></e> <e lm="school"><i>school</i><par n="beer__n"/></e> <e lm="computer"><i>computer</i><par n="beer__n"/></e> <e lm="house"><i>house</i><par n="beer__n"/></e> </section> </dictionary>
It generates the following strings:
$ lt-expand test-en.dix beer:beer<n><sg> beers:beer<n><pl> school:school<n><sg> schools:school<n><pl> computer:computer<n><sg> computers:computer<n><pl> house:house<n><sg> houses:house<n><pl>
But our bidix is only:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e> <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e> </section> </dictionary>
We don't want to include the entries for "computer" and "school", because then we would get @
in our output.
HFST
Here is a Makefile that given two dictionaries test-en.dix
(the monodix) and test-en-eu.dix
(the bidix), will produce a binary transducer of the monodix in HFST format, that only contains the strings matching prefixes in the bidix.
all: lt-comp lr test-en.dix test-en.bin lt-comp lr test-en-eu.dix test-en-eu.bin lt-print test-en.bin > test-en.att lt-print test-en-eu.bin > test-en-eu.att hfst-txt2fst -e ε < test-en.att > test-en.fst hfst-txt2fst -e ε < test-en-eu.att > test-en-eu.fst hfst-invert test-en.fst -o test-en.mor.fst hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst echo " ?* " | hfst-regexp2fst > any.fst hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst clean: rm *.bin *.att *.fst
If we run hfst-fst2strings
, we get:
$ hfst-fst2strings test-en.trimmed.fst beer:beer<n><sg> beers:beer<n><pl> house:house<n><sg> houses:house<n><pl>
Compounds vs trimming in HFST
The sme.lexc can't be trimmed using the simple HFST trick, due to compounds.
Say you have cake n sg, cake n pl, beer n pl and beer n sg in monodix, while bidix has beer n and wine n. The HFST method without compounding is to intersect (cake|beer) n (sg|pl) with (beer|wine) n .* to get beer n (sg|pl).
But HFST represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like
- ((cake|beer) n sg)*(cake|beer) n (sg|pl)
The intersection of this with
- (beer|wine) n .*
is
- (beer n sg)*(cake|beer) n (sg|pl) | beer n pl
when it should have been
- (beer n sg)*(beer n (sg|pl)
Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing, so
lt-trim is able to understand compounds by simply skipping the compound tags
lttoolbox
The command is named lt-trim.
The implementation is currently available from github, see http://permalink.gmane.org/gmane.comp.nlp.apertium/4050
$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin $ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf-full.bin $ lt-trim ca-en.automorf-full.bin ca-en.autobil.bin ca-en.automorf.bin
where the left side of the second transducer (ca-en.autobil.bin) is altered so it
- gets a ".*" appended to it (so if "foo<vblex>" is in there, it will let through "foo<vblex><pres>"), and then so it
- has all lemq's moved to be after the tags (so if "foo<vblex># bar" is in there, it will let through "foo<vblex><pres># bar"),
and only strings from the first transducer whose right side (analysis) match the input side of the altered second transducer are included into the final compilation.