Difference between revisions of "Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
Line 92: Line 92:
 
houses:house<n><pl>
 
houses:house<n><pl>
 
</pre>
 
</pre>
  +
  +
==How to implement this in lttoolbox directly==
  +
  +
It might be nice to see it as:
  +
<pre>
  +
  +
$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin
  +
  +
$ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf.bin ca-en.autobil.bin
  +
  +
</pre>
  +
  +
Where the left side of the second transducer (ca-en.autobil.bin) is converted into prefixes, and only strings from the first transducer which match the prefixes in the second transducer are included into the final compilation.
   
 
[[Category:Development]]
 
[[Category:Development]]

Revision as of 18:02, 6 October 2012

Suppose you have the monodix:


<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
    <pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

It generates the following strings:

$ lt-expand test-en.dix
beer:beer<n><sg>
beers:beer<n><pl>
school:school<n><sg>
schools:school<n><pl>
computer:computer<n><sg>
computers:computer<n><pl>
house:house<n><sg>
houses:house<n><pl>

But our bidix is only:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
  </pardefs>
  <section id="main" type="standard">
    <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e>
    <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e>
  </section>
</dictionary>

We don't want to include the entries for "computer" and "school", because then we would get @ in our output.

Here is a Makefile that given two dictionaries test-en.dix (the monodix) and test-en-eu.dix (the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.

all:
	lt-comp lr test-en.dix test-en.bin
	lt-comp lr test-en-eu.dix test-en-eu.bin
	lt-print test-en.bin > test-en.att
	lt-print test-en-eu.bin > test-en-eu.att
	hfst-txt2fst -e ε <  test-en.att > test-en.fst
	hfst-txt2fst -e ε <  test-en-eu.att > test-en-eu.fst
	hfst-invert test-en.fst -o test-en.mor.fst
	hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
	echo " ?* " | hfst-regexp2fst > any.fst
	hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
	hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst


clean:
	rm *.bin *.att *.fst

If we run hfst-fst2strings, we get:

$ hfst-fst2strings test-en.trimmed.fst
beer:beer<n><sg>
beers:beer<n><pl>
house:house<n><sg>
houses:house<n><pl>

How to implement this in lttoolbox directly

It might be nice to see it as:


$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin

$ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf.bin ca-en.autobil.bin

Where the left side of the second transducer (ca-en.autobil.bin) is converted into prefixes, and only strings from the first transducer which match the prefixes in the second transducer are included into the final compilation.