Difference between revisions of "Hfst"

From Apertium
Jump to navigation Jump to search
Line 83: Line 83:
<pre>
<pre>
$ hfst-invert fao-gen.hfst -o fao-morph.hfst
$ hfst-invert fao-gen.hfst -o fao-morph.hfst
</pre>

==What is it?==
<pre>
<jacobEo> why use that?
<spectie> because it has a really expressive formalism for languages with complex morphology, like finnish, sami and basque
<jacobEo> could you give an example of the most important thing it can do that lttoolbox cant?
<spectie> stem internal variation

<jacobEo> and how does it do that?
<spectie> by composing different transducers
<spectie> jacobEo, e.g. you have your lexical transducer, then you have your phonological transducer and you compose the two
<spectie> jacobEo, it's like the postgeneration in apertium, but much more integrated
<jacobEo> so if we imaging good, better ,best was called good, geod, gyod and that was a rule for all adjectives?
<jacobEo> like that?
<jacobEo> i mean like a paradigm
<jacobEo> saying o=<pst>
<jacobEo> saying e=<cmp>
<jacobEo> saying y=<sup>
<jimregan> jacobEo, yes

<spectie> it's like taking care of "live+ed" --> "lived"
<spectie> instead of having two paradigms for "live" and "jump" you would have one paradigm
<spectie> +ed
<spectie> then you would have a phonological rule that says "at morpheme boundaries, collapse ee -> e
<jacobEo> k

<jacobEo> so its much slower, I suppose
<jimregan> nah
<spectie> jacobEo, the compilation is slower
<jimregan> slower to compile, sure

<jimregan> you provide definitions for things like what is a vowel, what is a consonant, and where umlauting happens and what it is
<jimregan> ...in nightmarish syntax that escaped from the 70s
<spectie> http://paste2.org/p/532099

LowerG2 = [
[
(Cns:0) LCnsPhon7 (Cns:0) LCnsPhon (Cns:0) ! xy that cannot be G3, since x cannot form xy G3.
| ! nijbe,
[ ! This section is for 3-cons G2. and for 2cns G2 that share the initial cns with 2cns G3
(Cns:0) [:j|:l|:m|:n|:v] (Cns:0) :s :t ! S9, 3cns-G2 bäjstov, etc.
|
(Cns:0) [ :l | :r | :n | :j ] (Cns:0) :s :k ! S9, 3cns-G2 sválskes, etc.
|
(Cns:0) [ ! 2cns G2 that share the initial cns wit 2cns G3
:b (Cns:0) [ :d | :m | :j | :l | :n | :n :j | :r | :s | :t :j | :t :s ] ! S9, initial b
| ! gábdev
:d (Cns:0) [ :j | :n | :n :j ] ! S7, initial d
| ! iednev
:g (Cns:0) :ŋ ! S7, initial g
| ! låg0ŋot
:k (Cns:0) [ :n | :k ] ! S7, initial k
|
:g (Cns:0) :n !däggna:degna
] !
|
:r (Cns:0) :s :j :t ! S9, rsjt, bårsjtav
]
]
- [
[ d t [s|j] ] | b b | d d | g g | k [ s | t | t j | t s ] |
f ':0 f | l ':0 l | m ':0 m | n ':0 n | n ':0 n j | ŋ ':0 ŋ | r ':0 r | s ':0 s | s ':0 s j | v ':0 v
]
];


<spectie> the formalism is human-hostile
<Unhammer> ^^^ and sed-hostile
</pre>
</pre>



Revision as of 19:53, 25 November 2009

hfst is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like foma is to xfst.

Prerequisites

  • automake, autoconf, libtool

Compiling

Subversion checkout

"MacOS X note: you need XCode installed on your Mac. It came with your computer, and can be downloaded from Apple (registration required)"
$ svn co https://hfst.svn.sourceforge.net/svnroot/hfst/trunk hfst 
$ cd hfst/hfst/
$ autoreconf -i
$ ./configure --prefix=/home/fran/local/
$ make
$ sudo make install

Prepackaged tarball

Download the latest version from [1], and unzip. Then follow the instructions in the README file, i.e.:

$ cd hfst-2.0/
$ ./configure
$ make
$ sudo make install

Using

$ svn co https://victorio.uit.no/langtech/trunk/st/fao
$ cd fao/src
$ make -f Makefile.hfst

$ echo "orð" | hfst-lookup ../bin/fao-morph.hfst
lookup> 
orð	orð+N+Neu+Sg+Nom+Indef
orð	orð+N+Neu+Sg+Acc+Indef
orð	orð+N+Neu+Pl+Nom+Indef
orð	orð+N+Neu+Pl+Acc+Indef

lookup>
$

To compile lexc code, first concatenate all the lexc files:

$ cat fao-lex.txt noun-fao-lex.txt noun-fao-morph.txt adj-fao-lex.txt \
adj-fao-morph.txt verb-fao-lex.txt verb-fao-morph.txt adv-fao-lex.txt \
abbr-fao-lex.txt acro-fao-lex.txt pron-fao-lex.txt punct-fao-lex.txt \
numeral-fao-lex.txt pp-fao-lex.txt cc-fao-lex.txt cs-fao-lex.txt \
interj-fao-lex.txt det-fao-lex.txt > ../tmp/lexc-all.txt

To compile this, just use the hfst-lexc program,

hfst-lexc < ../tmp/lexc-all.txt > ../bin/lexc-fao.bin

To compile the twol rules, just use the hfst-twolc program,

$ hfst-twolc twol-fao.txt > twol-fao.bin

And then to compose the lexicon and rule file, use hfst-compose-intersect:

$ hfst-compose-intersect -l lexc-fao.bin twol-fao.bin -o fao-gen.hfst

This will create a generator, if you want an analyser, you just need to invert the generator with hfst-invert:

$ hfst-invert fao-gen.hfst -o fao-morph.hfst

What is it?

<jacobEo> why use that?
<spectie> because it has a really expressive formalism for languages with complex morphology, like finnish, sami and basque
<jacobEo> could you give an example of the most important thing it can do that lttoolbox cant?
<spectie> stem internal variation

<jacobEo> and how does it do that?
<spectie> by composing different transducers
<spectie> jacobEo, e.g. you have your lexical transducer, then you have your phonological transducer and you compose the two
<spectie> jacobEo, it's like the postgeneration in apertium, but much more integrated
<jacobEo> so if we imaging good, better ,best was called good, geod, gyod and that was a rule for all adjectives?
<jacobEo> like that?
<jacobEo> i mean like a paradigm
<jacobEo> saying o=<pst>
<jacobEo> saying e=<cmp>
<jacobEo> saying y=<sup>
<jimregan> jacobEo, yes

<spectie> it's like taking care of "live+ed" --> "lived"
<spectie> instead of having two paradigms for "live" and "jump" you would have one paradigm
<spectie> +ed
<spectie> then you would have a phonological rule that says "at morpheme boundaries, collapse ee -> e
<jacobEo> k

<jacobEo> so its much slower, I suppose
<jimregan> nah
<spectie> jacobEo, the compilation is slower
<jimregan> slower to compile, sure

<jimregan> you provide definitions for things like what is a vowel, what is a consonant, and where umlauting happens and what it is
<jimregan> ...in nightmarish syntax that escaped from the 70s
<spectie> http://paste2.org/p/532099

 LowerG2   = [
               [
                  (Cns:0)  LCnsPhon7 (Cns:0) LCnsPhon (Cns:0)            ! xy that cannot be G3, since x cannot form xy G3.
               |                                                         ! nijbe,
                  [           ! This section is for 3-cons G2. and for 2cns G2 that share the initial cns with 2cns G3
                     (Cns:0) [:j|:l|:m|:n|:v] (Cns:0) :s :t                  ! S9, 3cns-G2 bäjstov, etc.
                  |
                     (Cns:0) [ :l | :r | :n | :j ]      (Cns:0) :s :k                   ! S9, 3cns-G2 sválskes, etc.
                  |
                     (Cns:0) [  ! 2cns G2 that share the initial cns wit 2cns G3
                                :b (Cns:0) [ :d | :m | :j | :l | :n | :n :j | :r | :s | :t :j | :t :s  ] ! S9, initial b
                                |                                                                        ! gábdev
                                :d (Cns:0) [ :j | :n | :n :j ]                                           ! S7, initial d
                                |                                                                        ! iednev
                                :g (Cns:0) :ŋ                                                            ! S7, initial g
                                |                                                                        ! låg0ŋot
                                :k (Cns:0) [ :n | :k ]                                                   ! S7, initial k
                                |
                                :g (Cns:0) :n                                                            !däggna:degna
                             ]                                                                           !
                             |
                             :r (Cns:0)   :s :j :t                                                    ! S9, rsjt, bårsjtav
                  ]
               ]
                 - [
                     [ d t [s|j] ] |  b b | d d | g g | k [ s | t | t j | t s ]  |
                     f ':0 f | l ':0 l | m ':0 m | n ':0 n | n ':0 n j | ŋ ':0 ŋ | r ':0 r | s ':0 s | s ':0 s j | v ':0 v
                     ]
             ];


<spectie> the formalism is human-hostile
<Unhammer> ^^^ and sed-hostile

External links