Difference between revisions of "Automatically trimming a monodix"

Revision as of 17:58, 6 October 2012

Suppose you have the monodix:


<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
    <pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

It generates the following strings:

$ lt-expand test-en.dix
beer:beer<n><sg>
beers:beer<n><pl>
school:school<n><sg>
schools:school<n><pl>
computer:computer<n><sg>
computers:computer<n><pl>
house:house<n><sg>
houses:house<n><pl>

But our bidix is only:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
  </pardefs>
  <section id="main" type="standard">
    <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e>
    <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e>
  </section>
</dictionary>

We don't want to include the entries for "computer" and "school", because then we would get @ in our output.

Here is a Makefile that given two dictionaries test-en.dix (the monodix) and test-en-eu.dix (the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.

all:
	lt-comp lr test-en.dix test-en.bin
	lt-comp lr test-en-eu.dix test-en-eu.bin
	lt-print test-en.bin > test-en.att
	lt-print test-en-eu.bin > test-en-eu.att
	hfst-txt2fst -e ε <  test-en.att > test-en.fst
	hfst-txt2fst -e ε <  test-en-eu.att > test-en-eu.fst
	hfst-invert test-en.fst -o test-en.mor.fst
	hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
	echo " ?* " | hfst-regexp2fst > any.fst
	hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
	hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst


clean:
	rm *.bin *.att *.fst

If we run hfst-fst2strings, we get:

$ hfst-fst2strings test-en.trimmed.fst
beer:beer<n><sg>
beers:beer<n><pl>
house:house<n><sg>
houses:house<n><pl>

@@ Line 1: / Line 1: @@
-Suppose you have the monodix:
+Suppose you have the [[monodix]]:
 <pre>
@@ Line 41: / Line 40: @@
 </pre>
-But our bidix is only:
+But our [[bidix]] is only:
 <pre>
@@ Line 62: / Line 61: @@
 We don't want to include the entries for "computer" and "school", because then we would get <code>@</code> in our output.
-Here is a Makefile that given two dictionaries <code>test-en.dix</code> (the monodix) and <code>test-en-eu.dix</code> (the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.
+Here is a Makefile that given two dictionaries <code>test-en.dix</code> (the monodix) and <code>test-en-eu.dix</code> (the bidix), will produce a binary transducer of the monodix (in [[HFST]] format for now) that only contains the strings matching prefixes in the bidix.
 <pre>
@@ Line 84: / Line 83: @@
 </pre>
+If we run <code>hfst-fst2strings</code>, we get:
+<pre>
+$ hfst-fst2strings test-en.trimmed.fst
+beer:beer<n><sg>
+beers:beer<n><pl>
+house:house<n><sg>
+houses:house<n><pl>
+</pre>
 [[Category:Development]]

Difference between revisions of "Automatically trimming a monodix"

Revision as of 17:58, 6 October 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools