Difference between revisions of "Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
Line 4: Line 4:
* it means that any improvements in the dictionary we copied from aren't carried over into our new dictionary.
* it means that any improvements in the dictionary we copied from aren't carried over into our new dictionary.


So why do we do it ? -- [[Testvoc]]. If we have entries in our monodix that aren't in our [[bidix]], then we get lots of <code>@</code> in our output. This is bad.
So why do we do it ? -- [[Testvoc]]. If we have entries in our monodix that aren't in our [[bidix]], then we get lots of <code>@</code> in our output. This is [[Why we trim|bad]].


One way to get around the problem is to use ''ad hoc'' scripts for trimming the dictionaries (see for example: [[apertium-af-nl]], [[apertium-sme-nob]] and the <code>trim-lexc.py</code> script in <code>trunk/apertium-tools</code>), but these are less than ideal because usually they have to include specific hacks for differences in dictionary format.
One way to get around the problem is to use ''ad hoc'' scripts for trimming the dictionaries (see for example: [[apertium-af-nl]], [[apertium-sme-nob]] and the <code>trim-lexc.py</code> script in <code>trunk/apertium-tools</code>), but these are less than ideal because usually they have to include specific hacks for differences in dictionary format.

Revision as of 09:22, 7 October 2012

At the moment we have a problem in Apertium regarding copies. When we come to make a new language pair that is based off some existing resource in Apertium (e.g. a monodix), then usually what we do is make a copy of that resource, and then change it as we need. This is less than ideal because:

  • it means that any improvements we make aren't automatically carried over to the dictionary we copied from
  • it means that any improvements in the dictionary we copied from aren't carried over into our new dictionary.

So why do we do it ? -- Testvoc. If we have entries in our monodix that aren't in our bidix, then we get lots of @ in our output. This is bad.

One way to get around the problem is to use ad hoc scripts for trimming the dictionaries (see for example: apertium-af-nl, apertium-sme-nob and the trim-lexc.py script in trunk/apertium-tools), but these are less than ideal because usually they have to include specific hacks for differences in dictionary format.

Another way might be to take the intersection of our monodix and our bidix, and use that for analysis. That's what is described below.

Note: This doesn't take care of the whole testvoc problem. It would still be necessary to get rid of # symbols.

Example

Suppose you have the monodix:


<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
    <pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

It generates the following strings:

$ lt-expand test-en.dix
beer:beer<n><sg>
beers:beer<n><pl>
school:school<n><sg>
schools:school<n><pl>
computer:computer<n><sg>
computers:computer<n><pl>
house:house<n><sg>
houses:house<n><pl>

But our bidix is only:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
  </pardefs>
  <section id="main" type="standard">
    <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e>
    <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e>
  </section>
</dictionary>

We don't want to include the entries for "computer" and "school", because then we would get @ in our output.

Here is a Makefile that given two dictionaries test-en.dix (the monodix) and test-en-eu.dix (the bidix), will produce a binary transducer of the monodix (in HFST format for now) that only contains the strings matching prefixes in the bidix.

all:
	lt-comp lr test-en.dix test-en.bin
	lt-comp lr test-en-eu.dix test-en-eu.bin
	lt-print test-en.bin > test-en.att
	lt-print test-en-eu.bin > test-en-eu.att
	hfst-txt2fst -e ε <  test-en.att > test-en.fst
	hfst-txt2fst -e ε <  test-en-eu.att > test-en-eu.fst
	hfst-invert test-en.fst -o test-en.mor.fst
	hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
	echo " ?* " | hfst-regexp2fst > any.fst
	hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
	hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst


clean:
	rm *.bin *.att *.fst

If we run hfst-fst2strings, we get:

$ hfst-fst2strings test-en.trimmed.fst
beer:beer<n><sg>
beers:beer<n><pl>
house:house<n><sg>
houses:house<n><pl>

How to implement this in lttoolbox directly

It might be nice to see it as:


$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin

$ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf.bin ca-en.autobil.bin

Where the left side of the second transducer (ca-en.autobil.bin) is converted into prefixes, and only strings from the first transducer which match the prefixes in the second transducer are included into the final compilation.