Difference between revisions of "Automatically trimming a monodix"
Jump to navigation
Jump to search
(Created page with 'Suppose you have the monodix: <pre> <dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs>…') |
|||
Line 1: | Line 1: | ||
Suppose you have the monodix: |
Suppose you have the [[monodix]]: |
||
<pre> |
<pre> |
||
Line 41: | Line 40: | ||
</pre> |
</pre> |
||
But our bidix is only: |
But our [[bidix]] is only: |
||
<pre> |
<pre> |
||
Line 62: | Line 61: | ||
We don't want to include the entries for "computer" and "school", because then we would get <code>@</code> in our output. |
We don't want to include the entries for "computer" and "school", because then we would get <code>@</code> in our output. |
||
Here is a Makefile that given two dictionaries <code>test-en.dix</code> (the monodix) and <code>test-en-eu.dix</code> (the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix. |
Here is a Makefile that given two dictionaries <code>test-en.dix</code> (the monodix) and <code>test-en-eu.dix</code> (the bidix), will produce a binary transducer of the monodix (in [[HFST]] format for now) that only contains the strings matching prefixes in the bidix. |
||
<pre> |
<pre> |
||
Line 84: | Line 83: | ||
</pre> |
</pre> |
||
If we run <code>hfst-fst2strings</code>, we get: |
|||
<pre> |
|||
$ hfst-fst2strings test-en.trimmed.fst |
|||
beer:beer<n><sg> |
|||
beers:beer<n><pl> |
|||
house:house<n><sg> |
|||
houses:house<n><pl> |
|||
</pre> |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 17:58, 6 October 2012
Suppose you have the monodix:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> <pardef n="beer__n"> <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e> <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="beer"><i>beer</i><par n="beer__n"/></e> <e lm="school"><i>school</i><par n="beer__n"/></e> <e lm="computer"><i>computer</i><par n="beer__n"/></e> <e lm="house"><i>house</i><par n="beer__n"/></e> </section> </dictionary>
It generates the following strings:
$ lt-expand test-en.dix beer:beer<n><sg> beers:beer<n><pl> school:school<n><sg> schools:school<n><pl> computer:computer<n><sg> computers:computer<n><pl> house:house<n><sg> houses:house<n><pl>
But our bidix is only:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> </pardefs> <section id="main" type="standard"> <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e> <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e> </section> </dictionary>
We don't want to include the entries for "computer" and "school", because then we would get @
in our output.
Here is a Makefile that given two dictionaries test-en.dix
(the monodix) and test-en-eu.dix
(the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.
all: lt-comp lr test-en.dix test-en.bin lt-comp lr test-en-eu.dix test-en-eu.bin lt-print test-en.bin > test-en.att lt-print test-en-eu.bin > test-en-eu.att hfst-txt2fst -e ε < test-en.att > test-en.fst hfst-txt2fst -e ε < test-en-eu.att > test-en-eu.fst hfst-invert test-en.fst -o test-en.mor.fst hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst echo " ?* " | hfst-regexp2fst > any.fst hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst clean: rm *.bin *.att *.fst
If we run hfst-fst2strings
, we get:
$ hfst-fst2strings test-en.trimmed.fst beer:beer<n><sg> beers:beer<n><pl> house:house<n><sg> houses:house<n><pl>