ATT format
Revision as of 21:24, 13 March 2017 by Francis Tyers (talk | contribs)
ATT format is a transducer format based on a four-column layout. It is a tab separated four-column format.
Both lttoolbox and HFST can read ATT format as input to compile dictionaries (lt-comp, hfst-txt2fst), and print compiled dictionaries to ATT format (lt-print, hfst-fst2txt).
Example[edit]
Say we want to represent the following transducer:
We can do it thusly:
$ cat test.dix
<dictionary>
<alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
<sdefs>
<sdef n="n"/>
</sdefs>
<section id="main" type="standard">
<e><p><l>test</l><r>foo</r></p></e>
</section>
</dictionary>
$ lt-comp lr test.dix test.bin
main@standard 5 4
$ lt-print test.bin
0 1 t f
1 2 e o
2 3 s o
3 4 t ε
4
Weights[edit]
AT&T format supports "weights", for example to estimate likelihoods. The default interpretation is bigger the weight (heavier) the worse it is (aka penalties). E.g.:
0 1 c c 1.000000 0 2 d d 2.000000 1 3 a a 0.000000 2 4 o o 0.000000 3 5 t t 0.000000 4 5 g g 0.000000 5 6 s s 10.000000 5 0.000000 6 0.000000
would be appropriate to have weights 1 for cat, 2 for dog, and additional 10 pounds for beign a plural. Commonly weights are estimated e.g. from probabilities using -log().