Difference between revisions of "Weighting with the bilingual dictionary"

From Apertium
Jump to navigation Jump to search
(Created page with "<pre> $ cat bidix.dix <dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="vblex"/> <sdef n="inf"/> <sdef n="sg"/> <sdef n="pl"/> </sde...")
 
 
(2 intermediate revisions by 2 users not shown)
Line 57: Line 57:
   
 
</pre>
 
</pre>
<pre>
 
all:
 
lt-comp lr bidix.dix bidix.bin
 
lt-print bidix.bin > bidix.att
 
hfst-txt2fst -e ε bidix.att -o bidix.hfst
 
echo "?*" | hfst-regexp2fst -o anystar.hfst
 
hfst-concatenate -1 bidix.hfst -2 anystar.hfst -o bidix-prefixes.hfst
 
lt-comp lr analyser.dix analyser.bin
 
lt-print analyser.bin > analyser.att
 
hfst-txt2fst -e ε analyser.att -o analyser.hfst
 
hfst-compose -1 analyser.hfst -2 bidix-prefixes.hfst -o analyser-bidix-found.hfst
 
hfst-project -p input analyser-bidix-found.hfst -o analyser-found.analyses.hfst
 
hfst-invert analyser.hfst | hfst-compose -1 - -2 analyser-found.analyses.hfst | hfst-invert -o analyser-found.hfst
 
hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o analyser-unfound.hfst
 
hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst
 
hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o analyser.weighted.hfst
 
hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att
 
lt-comp lr analyser.weighted.att analyser.weighted.bin
 
   
  +
Solution using lt-trim:
clean:
 
rm *.hfst *.att
 
   
 
<pre>
 
all:
 
lt-comp lr analyser.dix analyser.bin
 
lt-comp lr bidix.dix bidix.bin
 
lt-trim analyser.bin bidix.bin analyser-found.bin
 
lt-print analyser.bin > analyser.att
 
lt-print analyser-found.bin > analyser-found.att
 
hfst-txt2fst -e ε analyser.att -o analyser.hfst
 
hfst-txt2fst -e ε analyser-found.att -o analyser-found.hfst
  +
 
hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o analyser-unfound.hfst
 
hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst
 
hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o analyser.weighted.hfst
 
hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att
 
lt-comp lr analyser.weighted.att analyser.weighted.bin
 
clean:
 
rm *.hfst *.att
 
</pre>
 
</pre>
   
  +
Output
 
<pre>
 
<pre>
$ hfst-fst2strings -w analyser.weighted.hfst
 
 
pat:pat<vblex><inf> 6
 
pat:pat<vblex><inf> 6
 
pats:pat<vblex><pres> 6
 
pats:pat<vblex><pres> 6
Line 88: Line 86:
 
pat:pat<n><sg> 6
 
pat:pat<n><sg> 6
 
pats:pat<n><pl> 6
 
pats:pat<n><pl> 6
 
bat:bat<vblex><inf> 6
 
bats:bat<vblex><pres> 6
 
batting:bat<vblex><ger> 8
 
batting:bat<vblex><ger> 8
 
rat:rat<n><sg> 6
 
rats:rat<n><pl> 6
 
cat:cat<vblex><inf> 6
 
cats:cat<vblex><pres> 6
 
catting:cat<vblex><ger> 8
 
catting:cat<vblex><ger> 8
 
cat:cat<n><sg> 0
 
cat:cat<n><sg> 0
cat:cat<vblex><inf> 0
 
 
cats:cat<n><pl> 0
 
cats:cat<n><pl> 0
cats:cat<vblex><pres> 0
 
rat:rat<n><sg> 0
 
 
rat:rat<vblex><inf> 0
 
rat:rat<vblex><inf> 0
ratting:rat<vblex><ger> 0
 
rats:rat<n><pl> 0
 
 
rats:rat<vblex><pres> 0
 
rats:rat<vblex><pres> 0
 
ratting:rat<vblex><ger> 0
 
bat:bat<n><sg> 0
 
bat:bat<n><sg> 0
bat:bat<vblex><inf> 0
 
 
bats:bat<n><pl> 0
 
bats:bat<n><pl> 0
bats:bat<vblex><pres> 0
 
 
 
</pre>
 
</pre>
  +
  +
  +
TODO:
  +
  +
* Weight bidix with IBM model 1 (e.g. using fastalign) then propagate those weights to the FST.
  +
  +
[[Category:Development]]

Latest revision as of 23:16, 22 May 2020

$ cat bidix.dix 
<dictionary>
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
<sdefs>
<sdef n="n"/>
<sdef n="vblex"/>
<sdef n="inf"/>
<sdef n="sg"/>
<sdef n="pl"/>
</sdefs>
<section id="main" type="standard">
<e><p><l>cat<s n="n"/></l><r>gato<s n="n"/></r></p></e>
<e><p><l>bat<s n="n"/></l><r>murciélago<s n="n"/></r></p></e>
<e><p><l>rat<s n="vblex"/></l><r>delatar<s n="vblex"/></r></p></e>
</section>
</dictionary>


$ cat analyser.dix 
<dictionary>
<alphabet/>
<sdefs>
<sdef n="n"/>
<sdef n="vblex"/>
<sdef n="ger"/>
<sdef n="inf"/>
<sdef n="pres"/>
<sdef n="sg"/>
<sdef n="pl"/>
</sdefs>
<pardefs>
  <pardef n="cat__n">
    <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e>
    <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e>
  </pardef>
  <pardef n="pat__vblex">
    <e><p><l></l><r><s n="vblex"/><s n="inf"/></r></p></e>
    <e><p><l>s</l><r><s n="vblex"/><s n="pres"/></r></p></e>
    <e><p><l>ting</l><r><s n="vblex"/><s n="ger"/></r></p></e>
  </pardef>
</pardefs>
<section id="main" type="standard">
<e lm="cat"><i>cat</i><par n="cat__n"/></e>
<e lm="rat"><i>rat</i><par n="cat__n"/></e>
<e lm="bat"><i>bat</i><par n="cat__n"/></e>
<e lm="pat"><i>pat</i><par n="cat__n"/></e>
<e lm="cat"><i>cat</i><par n="pat__vblex"/></e>
<e lm="rat"><i>rat</i><par n="pat__vblex"/></e>
<e lm="bat"><i>bat</i><par n="pat__vblex"/></e>
<e lm="pat"><i>pat</i><par n="pat__vblex"/></e>
</section>
</dictionary>

Solution using lt-trim:

all:
        lt-comp lr analyser.dix analyser.bin
        lt-comp lr bidix.dix bidix.bin
        lt-trim analyser.bin bidix.bin analyser-found.bin
        lt-print analyser.bin > analyser.att
        lt-print analyser-found.bin > analyser-found.att
        hfst-txt2fst -e ε analyser.att -o analyser.hfst
        hfst-txt2fst -e ε analyser-found.att -o analyser-found.hfst
 
        hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o analyser-unfound.hfst
        hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst
        hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o analyser.weighted.hfst
        hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att
        lt-comp lr analyser.weighted.att analyser.weighted.bin
clean:
        rm *.hfst *.att

Output

pat:pat<vblex><inf>	6
pats:pat<vblex><pres>	6
patting:pat<vblex><ger>	8
pat:pat<n><sg>	6
pats:pat<n><pl>	6
bat:bat<vblex><inf>	6
bats:bat<vblex><pres>	6
batting:bat<vblex><ger>	8
rat:rat<n><sg>	6
rats:rat<n><pl>	6
cat:cat<vblex><inf>	6
cats:cat<vblex><pres>	6
catting:cat<vblex><ger>	8
cat:cat<n><sg>	0
cats:cat<n><pl>	0
rat:rat<vblex><inf>	0
rats:rat<vblex><pres>	0
ratting:rat<vblex><ger>	0
bat:bat<n><sg>	0
bats:bat<n><pl>	0


TODO:

  • Weight bidix with IBM model 1 (e.g. using fastalign) then propagate those weights to the FST.