Difference between revisions of "Automatically trimming a monodix"

From Apertium
Jump to navigation Jump to search
(Link to French page)
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
[[Tronquer automatiquement un dictionnaire morphologique|En français]]
At the moment we have a problem in Apertium regarding copies. When we come to make a new language pair that is based off some existing resource in Apertium (e.g. a [[monodix]]), then usually what we do is make a copy of that resource, and then change it as we need. This is less than ideal because:
 
   
 
There used to be a problem in Apertium regarding copies. When we came to make a new language pair based off some existing resource in Apertium (e.g. a [[monodix]]), we would make a copy of that resource, and then change it as we need. This was less than ideal because:
* it means that any improvements we make aren't automatically carried over to the dictionary we copied from
 
* it means that any improvements in the dictionary we copied from aren't carried over into our new dictionary.
 
   
 
* it meant that any improvements we made weren't automatically carried over to the dictionary we copied from
So why do we do it ? -- [[Testvoc]]. If we have entries in our monodix that aren't in our [[bidix]], then we get lots of <code>@</code> in our output. This is [[Why we trim|bad]].
 
 
* it meant that any improvements in the dictionary we copied from weren't carried over into our new dictionary.
   
 
So why did we do it ? -- [[Testvoc]]. If we had entries in our monodix that were not in our [[bidix]], then we got lots of <code>@</code> in our output. This is [[Why we trim|bad]].
One way to get around the problem is to use ''ad hoc'' scripts for trimming the dictionaries (see for example: [[apertium-af-nl]], [[apertium-sme-nob]] and the <code>trim-lexc.py</code> script in <code>trunk/apertium-tools</code>), but these are less than ideal because usually they have to include specific hacks for differences in dictionary format.
 
   
 
One way we got around the problem was to use ''ad hoc'' scripts for trimming the dictionaries (see for example: [[apertium-af-nl]], [[apertium-sme-nob]] and the <code>trim-lexc.py</code> script in <code>trunk/apertium-tools</code>), but these were less than ideal because usually they had to include specific hacks for differences in dictionary format.
Another way is to take the ''intersection'' of our monodix and our bidix, and use that for analysis. That's what is described below.
 
   
 
The new and improved way is to take the ''intersection'' of our monodix and our bidix binaries, and use that for analysis. That's what is described below.
'''''Note''': This doesn't take care of the whole [[testvoc]] problem. It would still be necessary to get rid of <code>#</code> symbols.''
 
  +
 
'''''Note''': This doesn't take care of the whole [[testvoc]] problem. It is still necessary to get rid of <code>#</code> symbols, which often arise due to bugs in transfer code.''
   
 
==Example==
 
==Example==
Line 106: Line 108:
 
house:house<n><sg>
 
house:house<n><sg>
 
houses:house<n><pl>
 
houses:house<n><pl>
  +
</pre>
  +
  +
=== Compounds vs trimming in HFST ===
  +
  +
The sme.lexc needs a more complicated trimming system, due to compounds.
  +
  +
Say you have '''cake n sg''', '''cake n pl''', '''beer n pl''' and '''beer n sg''' in monodix, while bidix has '''beer n''' and '''wine n'''. The HFST method without compounding is to intersect '''(cake|beer) n (sg|pl)''' with '''(beer|wine) n .*''' to get '''beer n (sg|pl)'''.
  +
  +
But the sme lexicon represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like
  +
: '''((cake|beer) n sg)*(cake|beer) n (sg|pl)'''
  +
The intersection of this with
  +
: '''(beer|wine) n .*'''
  +
is
  +
: '''(beer n sg)*(cake|beer) n (sg|pl) | beer n pl'''
  +
when it should have been
  +
: '''(beer n sg)*(beer n (sg|pl)'''
  +
  +
  +
Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing, so
  +
lt-trim is able to understand compounds by simply skipping the compound tags.
  +
  +
lttoolbox actually has another type of multiword that is more similar to how compounds in hfst work: the <code>&lt;j/&gt;</code> (JOIN) multiword. Here the full path is in the FST like <code>foo&lt;tag&gt;+bar&lt;othertag&gt;</code>. This is handled by lt-trim: When it sees a +, it moves to the start in bidix but keeps going from where it was in monodix.
  +
  +
In sme-nob, we ensured that anything that we wanted pretransfer to split would have a + in it, by using tag relabelling. So the FST would have paths like <code>foo&lt;tag&gt;+bar&lt;othertag&gt;</code> which are split into <code>foo&lt;tag&gt;</code> and <code>bar&lt;othertag&gt;</code>. (Unlike with JOIN, we can have several +'es in a row.)
  +
  +
The downside is that lexc writers have to really ensure that anything that is supposed to be as two units in bidix really has a +, but they have to do that anyway!
  +
  +
  +
Here's a make recipe for trimming with <code><nowiki>bidix [^+]* (+ bidix [^+]*)*</nowiki></code>:
  +
<pre>
  +
.deps/$(PREFIX1).autobil.prefixes: $(PREFIX1).autobil.bin .deps/.d
  +
lt-print $< | sed 's/ /@_SPACE_@/g' > .deps/$(PREFIX1).autobil.att
  +
hfst-txt2fst -e ε < .deps/$(PREFIX1).autobil.att > .deps/$(PREFIX1).autobil.hfst
  +
hfst-project -p upper .deps/$(PREFIX1).autobil.hfst > .deps/$(PREFIX1).autobil.upper
  +
echo ' [ ? - %+ ]* ' | hfst-regexp2fst > .deps/any-nonplus.hfst
  +
hfst-concatenate -1 .deps/$(PREFIX1).autobil.upper -2 .deps/any-nonplus.hfst -o .deps/$(PREFIX1).autobil.nonplussed # bidix [^+]*
  +
echo ' %+ ' | hfst-regexp2fst > .deps/single-plus.hfst
  +
hfst-concatenate -1 .deps/single-plus.hfst -2 .deps/$(PREFIX1).autobil.nonplussed -o .deps/$(PREFIX1).autobil.postplus # + bidix [^+]*
  +
hfst-repeat -f0 -t2 -i .deps/$(PREFIX1).autobil.postplus -o .deps/$(PREFIX1).autobil.postplus.0,2 # (+ bidix [^+]*){0,2} gives at most triple-compounds
  +
hfst-concatenate -1 .deps/$(PREFIX1).autobil.nonplussed -2 .deps/$(PREFIX1).autobil.postplus.0,2 -o $@
 
</pre>
 
</pre>
   
 
==lttoolbox==
 
==lttoolbox==
  +
The command is named [[lt-trim]].
The implementation is currently available from github, see http://permalink.gmane.org/gmane.comp.nlp.apertium/4050
 
   
 
<pre>
 
<pre>
Line 119: Line 161:
 
</pre>
 
</pre>
   
where the left side of the second transducer (ca-en.autobil.bin) gets a ".*" appended to it (so if "foo&gt;vblex&lt;" is in there, it will let through "foo&lt;vblex&gt;&lt;pres&gt;"), and only strings from the first transducer which match the prefixes in the second transducer are included into the final compilation.
+
where the left side of the second transducer (ca-en.autobil.bin) is altered so it
  +
* gets a ".*" appended to it (so if "foo&lt;vblex&gt;" is in there, it will let through "foo&lt;vblex&gt;&lt;pres&gt;"), and then so it
  +
* has all lemq's moved to be after the tags (so if "foo&lt;vblex&gt;# bar" is in there, it will let through "foo&lt;vblex&gt;&lt;pres&gt;# bar"),
  +
and only strings from the first transducer whose right side (analysis) match the input side of the altered second transducer are included into the final compilation.
   
 
==See also==
 
==See also==

Latest revision as of 14:37, 7 October 2014

En français

There used to be a problem in Apertium regarding copies. When we came to make a new language pair based off some existing resource in Apertium (e.g. a monodix), we would make a copy of that resource, and then change it as we need. This was less than ideal because:

  • it meant that any improvements we made weren't automatically carried over to the dictionary we copied from
  • it meant that any improvements in the dictionary we copied from weren't carried over into our new dictionary.

So why did we do it ? -- Testvoc. If we had entries in our monodix that were not in our bidix, then we got lots of @ in our output. This is bad.

One way we got around the problem was to use ad hoc scripts for trimming the dictionaries (see for example: apertium-af-nl, apertium-sme-nob and the trim-lexc.py script in trunk/apertium-tools), but these were less than ideal because usually they had to include specific hacks for differences in dictionary format.

The new and improved way is to take the intersection of our monodix and our bidix binaries, and use that for analysis. That's what is described below.

Note: This doesn't take care of the whole testvoc problem. It is still necessary to get rid of # symbols, which often arise due to bugs in transfer code.

Example[edit]

Suppose you have the monodix:


<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
    <pardef n="beer__n">
      <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
      <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
    </pardef>
  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

It generates the following strings:

$ lt-expand test-en.dix
beer:beer<n><sg>
beers:beer<n><pl>
school:school<n><sg>
schools:school<n><pl>
computer:computer<n><sg>
computers:computer<n><pl>
house:house<n><sg>
houses:house<n><pl>

But our bidix is only:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"/>
    <sdef n="sg"/>
    <sdef n="pl"/>
  </sdefs>
  <pardefs>
  </pardefs>
  <section id="main" type="standard">
    <e><p><l>beer<s n="n"/></l><r>garagardo<s n="n"/></r></p></e>
    <e><p><l>house<s n="n"/></l><r>etxe<s n="n"/></r></p></e>
  </section>
</dictionary>

We don't want to include the entries for "computer" and "school", because then we would get @ in our output.

HFST[edit]

Here is a Makefile that given two dictionaries test-en.dix (the monodix) and test-en-eu.dix (the bidix), will produce a binary transducer of the monodix in HFST format, that only contains the strings matching prefixes in the bidix.

all:
	lt-comp lr test-en.dix test-en.bin
	lt-comp lr test-en-eu.dix test-en-eu.bin
	lt-print test-en.bin > test-en.att
	lt-print test-en-eu.bin > test-en-eu.att
	hfst-txt2fst -e ε <  test-en.att > test-en.fst
	hfst-txt2fst -e ε <  test-en-eu.att > test-en-eu.fst
	hfst-invert test-en.fst -o test-en.mor.fst
	hfst-project -p upper test-en-eu.fst > test-en-eu.en.fst
	echo " ?* " | hfst-regexp2fst > any.fst
	hfst-concatenate -1 test-en-eu.en.fst -2 any.fst -o test-en-eu.en-prefixes.fst
	hfst-compose-intersect -1 test-en-eu.en-prefixes.fst -2 test-en.mor.fst | hfst-invert -o test-en.trimmed.fst


clean:
	rm *.bin *.att *.fst

If we run hfst-fst2strings, we get:

$ hfst-fst2strings test-en.trimmed.fst
beer:beer<n><sg>
beers:beer<n><pl>
house:house<n><sg>
houses:house<n><pl>

Compounds vs trimming in HFST[edit]

The sme.lexc needs a more complicated trimming system, due to compounds.

Say you have cake n sg, cake n pl, beer n pl and beer n sg in monodix, while bidix has beer n and wine n. The HFST method without compounding is to intersect (cake|beer) n (sg|pl) with (beer|wine) n .* to get beer n (sg|pl).

But the sme lexicon represents compounding as a transition from the end of the singular noun to the beginning of the (noun) transducer, so a compounding HFST actually looks like

((cake|beer) n sg)*(cake|beer) n (sg|pl)

The intersection of this with

(beer|wine) n .*

is

(beer n sg)*(cake|beer) n (sg|pl) | beer n pl

when it should have been

(beer n sg)*(beer n (sg|pl)


Lttoolbox doesn't represent compounding by extra circular transitions, but instead by a special restart symbol interpreted while analysing, so lt-trim is able to understand compounds by simply skipping the compound tags.

lttoolbox actually has another type of multiword that is more similar to how compounds in hfst work: the <j/> (JOIN) multiword. Here the full path is in the FST like foo<tag>+bar<othertag>. This is handled by lt-trim: When it sees a +, it moves to the start in bidix but keeps going from where it was in monodix.

In sme-nob, we ensured that anything that we wanted pretransfer to split would have a + in it, by using tag relabelling. So the FST would have paths like foo<tag>+bar<othertag> which are split into foo<tag> and bar<othertag>. (Unlike with JOIN, we can have several +'es in a row.)

The downside is that lexc writers have to really ensure that anything that is supposed to be as two units in bidix really has a +, but they have to do that anyway!


Here's a make recipe for trimming with bidix [^+]* (+ bidix [^+]*)*:

.deps/$(PREFIX1).autobil.prefixes: $(PREFIX1).autobil.bin .deps/.d
        lt-print $< | sed 's/ /@_SPACE_@/g' > .deps/$(PREFIX1).autobil.att
        hfst-txt2fst -e ε <  .deps/$(PREFIX1).autobil.att > .deps/$(PREFIX1).autobil.hfst
        hfst-project -p upper .deps/$(PREFIX1).autobil.hfst > .deps/$(PREFIX1).autobil.upper
        echo ' [ ? - %+ ]* ' | hfst-regexp2fst > .deps/any-nonplus.hfst
        hfst-concatenate -1 .deps/$(PREFIX1).autobil.upper -2 .deps/any-nonplus.hfst -o .deps/$(PREFIX1).autobil.nonplussed # bidix [^+]*
        echo ' %+ ' | hfst-regexp2fst > .deps/single-plus.hfst
        hfst-concatenate -1 .deps/single-plus.hfst -2 .deps/$(PREFIX1).autobil.nonplussed -o .deps/$(PREFIX1).autobil.postplus # + bidix [^+]*
        hfst-repeat -f0 -t2 -i .deps/$(PREFIX1).autobil.postplus -o .deps/$(PREFIX1).autobil.postplus.0,2 # (+ bidix [^+]*){0,2} gives at most triple-compounds
        hfst-concatenate -1 .deps/$(PREFIX1).autobil.nonplussed -2 .deps/$(PREFIX1).autobil.postplus.0,2 -o $@

lttoolbox[edit]

The command is named lt-trim.

$ lt-comp rl apertium-en-ca.en-ca.dix ca-en.autobil.bin

$ lt-comp lr apertium-en-ca.ca.dix ca-en.automorf-full.bin 

$ lt-trim ca-en.automorf-full.bin  ca-en.autobil.bin  ca-en.automorf.bin

where the left side of the second transducer (ca-en.autobil.bin) is altered so it

  • gets a ".*" appended to it (so if "foo<vblex>" is in there, it will let through "foo<vblex><pres>"), and then so it
  • has all lemq's moved to be after the tags (so if "foo<vblex># bar" is in there, it will let through "foo<vblex><pres># bar"),

and only strings from the first transducer whose right side (analysis) match the input side of the altered second transducer are included into the final compilation.

See also[edit]