Difference between revisions of "Weighting with the bilingual dictionary"
Jump to navigation
Jump to search
(Created page with "<pre> $ cat bidix.dix <dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="vblex"/> <sdef n="inf"/> <sdef n="sg"/> <sdef n="pl"/> </sde...") |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 57: | Line 57: | ||
</pre> |
</pre> |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
echo "?*" | hfst-regexp2fst -o anystar.hfst |
|||
hfst-concatenate -1 bidix.hfst -2 anystar.hfst -o bidix-prefixes.hfst |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
hfst-invert analyser.hfst | hfst-compose -1 - -2 analyser-found.analyses.hfst | hfst-invert -o analyser-found.hfst |
|||
⚫ | |||
hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst |
|||
⚫ | |||
⚫ | |||
⚫ | |||
Solution using lt-trim: |
|||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
</pre> |
</pre> |
||
Output |
|||
<pre> |
<pre> |
||
$ hfst-fst2strings -w analyser.weighted.hfst |
|||
pat:pat<vblex><inf> 6 |
pat:pat<vblex><inf> 6 |
||
pats:pat<vblex><pres> 6 |
pats:pat<vblex><pres> 6 |
||
Line 88: | Line 86: | ||
pat:pat<n><sg> 6 |
pat:pat<n><sg> 6 |
||
pats:pat<n><pl> 6 |
pats:pat<n><pl> 6 |
||
⚫ | |||
⚫ | |||
batting:bat<vblex><ger> 8 |
batting:bat<vblex><ger> 8 |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
catting:cat<vblex><ger> 8 |
catting:cat<vblex><ger> 8 |
||
cat:cat<n><sg> 0 |
cat:cat<n><sg> 0 |
||
⚫ | |||
cats:cat<n><pl> 0 |
cats:cat<n><pl> 0 |
||
⚫ | |||
⚫ | |||
rat:rat<vblex><inf> 0 |
rat:rat<vblex><inf> 0 |
||
⚫ | |||
⚫ | |||
rats:rat<vblex><pres> 0 |
rats:rat<vblex><pres> 0 |
||
⚫ | |||
bat:bat<n><sg> 0 |
bat:bat<n><sg> 0 |
||
⚫ | |||
bats:bat<n><pl> 0 |
bats:bat<n><pl> 0 |
||
⚫ | |||
</pre> |
</pre> |
||
TODO: |
|||
* Weight bidix with IBM model 1 (e.g. using fastalign) then propagate those weights to the FST. |
|||
[[Category:Development]] |
Latest revision as of 23:16, 22 May 2020
$ cat bidix.dix <dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="n"/> <sdef n="vblex"/> <sdef n="inf"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <section id="main" type="standard"> <e><p><l>cat<s n="n"/></l><r>gato<s n="n"/></r></p></e> <e><p><l>bat<s n="n"/></l><r>murciélago<s n="n"/></r></p></e> <e><p><l>rat<s n="vblex"/></l><r>delatar<s n="vblex"/></r></p></e> </section> </dictionary>
$ cat analyser.dix <dictionary> <alphabet/> <sdefs> <sdef n="n"/> <sdef n="vblex"/> <sdef n="ger"/> <sdef n="inf"/> <sdef n="pres"/> <sdef n="sg"/> <sdef n="pl"/> </sdefs> <pardefs> <pardef n="cat__n"> <e><p><l></l><r><s n="n"/><s n="sg"/></r></p></e> <e><p><l>s</l><r><s n="n"/><s n="pl"/></r></p></e> </pardef> <pardef n="pat__vblex"> <e><p><l></l><r><s n="vblex"/><s n="inf"/></r></p></e> <e><p><l>s</l><r><s n="vblex"/><s n="pres"/></r></p></e> <e><p><l>ting</l><r><s n="vblex"/><s n="ger"/></r></p></e> </pardef> </pardefs> <section id="main" type="standard"> <e lm="cat"><i>cat</i><par n="cat__n"/></e> <e lm="rat"><i>rat</i><par n="cat__n"/></e> <e lm="bat"><i>bat</i><par n="cat__n"/></e> <e lm="pat"><i>pat</i><par n="cat__n"/></e> <e lm="cat"><i>cat</i><par n="pat__vblex"/></e> <e lm="rat"><i>rat</i><par n="pat__vblex"/></e> <e lm="bat"><i>bat</i><par n="pat__vblex"/></e> <e lm="pat"><i>pat</i><par n="pat__vblex"/></e> </section> </dictionary>
Solution using lt-trim:
all: lt-comp lr analyser.dix analyser.bin lt-comp lr bidix.dix bidix.bin lt-trim analyser.bin bidix.bin analyser-found.bin lt-print analyser.bin > analyser.att lt-print analyser-found.bin > analyser-found.att hfst-txt2fst -e ε analyser.att -o analyser.hfst hfst-txt2fst -e ε analyser-found.att -o analyser-found.hfst hfst-subtract -1 analyser.hfst -2 analyser-found.hfst -o analyser-unfound.hfst hfst-reweight -a 1 analyser-unfound.hfst -o analyser-unfound.weighted.hfst hfst-union -1 analyser-unfound.weighted.hfst -2 analyser-found.hfst -o analyser.weighted.hfst hfst-fst2txt analyser.weighted.hfst -o analyser.weighted.att lt-comp lr analyser.weighted.att analyser.weighted.bin clean: rm *.hfst *.att
Output
pat:pat<vblex><inf> 6 pats:pat<vblex><pres> 6 patting:pat<vblex><ger> 8 pat:pat<n><sg> 6 pats:pat<n><pl> 6 bat:bat<vblex><inf> 6 bats:bat<vblex><pres> 6 batting:bat<vblex><ger> 8 rat:rat<n><sg> 6 rats:rat<n><pl> 6 cat:cat<vblex><inf> 6 cats:cat<vblex><pres> 6 catting:cat<vblex><ger> 8 cat:cat<n><sg> 0 cats:cat<n><pl> 0 rat:rat<vblex><inf> 0 rats:rat<vblex><pres> 0 ratting:rat<vblex><ger> 0 bat:bat<n><sg> 0 bats:bat<n><pl> 0
TODO:
- Weight bidix with IBM model 1 (e.g. using fastalign) then propagate those weights to the FST.