Automatically trimming a monodix
Revision as of 18:02, 6 October 2012 by Francis Tyers (talk | contribs)
Suppose you have the monodix:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> <pardef n="beer__n"> <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e> <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="beer"><i>beer</i><par n="beer__n"/></e> <e lm="school"><i>school</i><par n="beer__n"/></e> <e lm="computer"><i>computer</i><par n="beer__n"/></e> <e lm="house"><i>house</i><par n="beer__n"/></e> </section> </dictionary>
It generates the following strings:
$ lt-expand test-en.dix beer:beer<n><sg> beers:beer<n><pl> school:school<n><sg> schools:school<n><pl> computer:computer<n><sg> computers:computer<n><pl> house:house<n><sg> houses:house<n><pl>
But our bidix is only:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e> <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e> </section> </dictionary>
We don't want to include the entries for "computer" and "school", because then we would get @
in our output.
Here is a Makefile that given two dictionaries test-en.dix
(the monodix) and test-en-eu.dix
(the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.
all: lt-comp lr test-en.dix test-en.bin lt-comp lr test-en-eu.dix test-en-eu.bin lt-print test-en.bin > test-en.att lt-print test-en-eu.bin > test-en-eu.att hfst-txt2fst -e ε < test-en.att > test-en.fst hfst-txt2fst -e ε < test-en-eu.att > test-en-eu.fst hfst-invert test-en.fst -o test-en.mor.fst hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst echo " ?* " | hfst-regexp2fst > any.fst hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst clean: rm *.bin *.att *.fst
If we run hfst-fst2strings
, we get:
$ hfst-fst2strings test-en.trimmed.fst beer:beer<n><sg> beers:beer<n><pl> house:house<n><sg> houses:house<n><pl>
How to implement this in lttoolbox directly
It might be nice to see it as:
$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin $ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf.bin ca-en.autobil.bin
Where the left side of the second transducer (ca-en.autobil.bin) is converted into prefixes, and only strings from the first transducer which match the prefixes in the second transducer are included into the final compilation.