Difference between revisions of "Hfst"
Jump to navigation
Jump to search
(→Using) |
|||
Line 83: | Line 83: | ||
<pre> |
<pre> |
||
$ hfst-invert fao-gen.hfst -o fao-morph.hfst |
$ hfst-invert fao-gen.hfst -o fao-morph.hfst |
||
</pre> |
|||
==What is it?== |
|||
<pre> |
|||
<jacobEo> why use that? |
|||
<spectie> because it has a really expressive formalism for languages with complex morphology, like finnish, sami and basque |
|||
<jacobEo> could you give an example of the most important thing it can do that lttoolbox cant? |
|||
<spectie> stem internal variation |
|||
<jacobEo> and how does it do that? |
|||
<spectie> by composing different transducers |
|||
<spectie> jacobEo, e.g. you have your lexical transducer, then you have your phonological transducer and you compose the two |
|||
<spectie> jacobEo, it's like the postgeneration in apertium, but much more integrated |
|||
<jacobEo> so if we imaging good, better ,best was called good, geod, gyod and that was a rule for all adjectives? |
|||
<jacobEo> like that? |
|||
<jacobEo> i mean like a paradigm |
|||
<jacobEo> saying o=<pst> |
|||
<jacobEo> saying e=<cmp> |
|||
<jacobEo> saying y=<sup> |
|||
<jimregan> jacobEo, yes |
|||
<spectie> it's like taking care of "live+ed" --> "lived" |
|||
<spectie> instead of having two paradigms for "live" and "jump" you would have one paradigm |
|||
<spectie> +ed |
|||
<spectie> then you would have a phonological rule that says "at morpheme boundaries, collapse ee -> e |
|||
<jacobEo> k |
|||
<jacobEo> so its much slower, I suppose |
|||
<jimregan> nah |
|||
<spectie> jacobEo, the compilation is slower |
|||
<jimregan> slower to compile, sure |
|||
<jimregan> you provide definitions for things like what is a vowel, what is a consonant, and where umlauting happens and what it is |
|||
<jimregan> ...in nightmarish syntax that escaped from the 70s |
|||
<spectie> http://paste2.org/p/532099 |
|||
LowerG2 = [ |
|||
[ |
|||
(Cns:0) LCnsPhon7 (Cns:0) LCnsPhon (Cns:0) ! xy that cannot be G3, since x cannot form xy G3. |
|||
| ! nijbe, |
|||
[ ! This section is for 3-cons G2. and for 2cns G2 that share the initial cns with 2cns G3 |
|||
(Cns:0) [:j|:l|:m|:n|:v] (Cns:0) :s :t ! S9, 3cns-G2 bäjstov, etc. |
|||
| |
|||
(Cns:0) [ :l | :r | :n | :j ] (Cns:0) :s :k ! S9, 3cns-G2 sválskes, etc. |
|||
| |
|||
(Cns:0) [ ! 2cns G2 that share the initial cns wit 2cns G3 |
|||
:b (Cns:0) [ :d | :m | :j | :l | :n | :n :j | :r | :s | :t :j | :t :s ] ! S9, initial b |
|||
| ! gábdev |
|||
:d (Cns:0) [ :j | :n | :n :j ] ! S7, initial d |
|||
| ! iednev |
|||
:g (Cns:0) :ŋ ! S7, initial g |
|||
| ! låg0ŋot |
|||
:k (Cns:0) [ :n | :k ] ! S7, initial k |
|||
| |
|||
:g (Cns:0) :n !däggna:degna |
|||
] ! |
|||
| |
|||
:r (Cns:0) :s :j :t ! S9, rsjt, bårsjtav |
|||
] |
|||
] |
|||
- [ |
|||
[ d t [s|j] ] | b b | d d | g g | k [ s | t | t j | t s ] | |
|||
f ':0 f | l ':0 l | m ':0 m | n ':0 n | n ':0 n j | ŋ ':0 ŋ | r ':0 r | s ':0 s | s ':0 s j | v ':0 v |
|||
] |
|||
]; |
|||
<spectie> the formalism is human-hostile |
|||
<Unhammer> ^^^ and sed-hostile |
|||
</pre> |
</pre> |
||
Revision as of 19:53, 25 November 2009
hfst is the Helsinki finite-state toolkit. This is formalism-compatible with both lexc and twolc, so, kind of like foma is to xfst.
Prerequisites
- automake, autoconf, libtool
Compiling
Subversion checkout
- "MacOS X note: you need XCode installed on your Mac. It came with your computer, and can be downloaded from Apple (registration required)"
$ svn co https://hfst.svn.sourceforge.net/svnroot/hfst/trunk hfst $ cd hfst/hfst/ $ autoreconf -i $ ./configure --prefix=/home/fran/local/ $ make $ sudo make install
Prepackaged tarball
Download the latest version from [1], and unzip. Then follow the instructions in the README file, i.e.:
$ cd hfst-2.0/ $ ./configure $ make $ sudo make install
Using
$ svn co https://victorio.uit.no/langtech/trunk/st/fao $ cd fao/src $ make -f Makefile.hfst $ echo "orð" | hfst-lookup ../bin/fao-morph.hfst lookup> orð orð+N+Neu+Sg+Nom+Indef orð orð+N+Neu+Sg+Acc+Indef orð orð+N+Neu+Pl+Nom+Indef orð orð+N+Neu+Pl+Acc+Indef lookup> $
To compile lexc
code, first concatenate all the lexc files:
$ cat fao-lex.txt noun-fao-lex.txt noun-fao-morph.txt adj-fao-lex.txt \ adj-fao-morph.txt verb-fao-lex.txt verb-fao-morph.txt adv-fao-lex.txt \ abbr-fao-lex.txt acro-fao-lex.txt pron-fao-lex.txt punct-fao-lex.txt \ numeral-fao-lex.txt pp-fao-lex.txt cc-fao-lex.txt cs-fao-lex.txt \ interj-fao-lex.txt det-fao-lex.txt > ../tmp/lexc-all.txt
To compile this, just use the hfst-lexc
program,
hfst-lexc < ../tmp/lexc-all.txt > ../bin/lexc-fao.bin
To compile the twol
rules, just use the hfst-twolc
program,
$ hfst-twolc twol-fao.txt > twol-fao.bin
And then to compose the lexicon and rule file, use hfst-compose-intersect
:
$ hfst-compose-intersect -l lexc-fao.bin twol-fao.bin -o fao-gen.hfst
This will create a generator, if you want an analyser, you just need to invert the generator with hfst-invert
:
$ hfst-invert fao-gen.hfst -o fao-morph.hfst
What is it?
<jacobEo> why use that? <spectie> because it has a really expressive formalism for languages with complex morphology, like finnish, sami and basque <jacobEo> could you give an example of the most important thing it can do that lttoolbox cant? <spectie> stem internal variation <jacobEo> and how does it do that? <spectie> by composing different transducers <spectie> jacobEo, e.g. you have your lexical transducer, then you have your phonological transducer and you compose the two <spectie> jacobEo, it's like the postgeneration in apertium, but much more integrated <jacobEo> so if we imaging good, better ,best was called good, geod, gyod and that was a rule for all adjectives? <jacobEo> like that? <jacobEo> i mean like a paradigm <jacobEo> saying o=<pst> <jacobEo> saying e=<cmp> <jacobEo> saying y=<sup> <jimregan> jacobEo, yes <spectie> it's like taking care of "live+ed" --> "lived" <spectie> instead of having two paradigms for "live" and "jump" you would have one paradigm <spectie> +ed <spectie> then you would have a phonological rule that says "at morpheme boundaries, collapse ee -> e <jacobEo> k <jacobEo> so its much slower, I suppose <jimregan> nah <spectie> jacobEo, the compilation is slower <jimregan> slower to compile, sure <jimregan> you provide definitions for things like what is a vowel, what is a consonant, and where umlauting happens and what it is <jimregan> ...in nightmarish syntax that escaped from the 70s <spectie> http://paste2.org/p/532099 LowerG2 = [ [ (Cns:0) LCnsPhon7 (Cns:0) LCnsPhon (Cns:0) ! xy that cannot be G3, since x cannot form xy G3. | ! nijbe, [ ! This section is for 3-cons G2. and for 2cns G2 that share the initial cns with 2cns G3 (Cns:0) [:j|:l|:m|:n|:v] (Cns:0) :s :t ! S9, 3cns-G2 bäjstov, etc. | (Cns:0) [ :l | :r | :n | :j ] (Cns:0) :s :k ! S9, 3cns-G2 sválskes, etc. | (Cns:0) [ ! 2cns G2 that share the initial cns wit 2cns G3 :b (Cns:0) [ :d | :m | :j | :l | :n | :n :j | :r | :s | :t :j | :t :s ] ! S9, initial b | ! gábdev :d (Cns:0) [ :j | :n | :n :j ] ! S7, initial d | ! iednev :g (Cns:0) :ŋ ! S7, initial g | ! låg0ŋot :k (Cns:0) [ :n | :k ] ! S7, initial k | :g (Cns:0) :n !däggna:degna ] ! | :r (Cns:0) :s :j :t ! S9, rsjt, bårsjtav ] ] - [ [ d t [s|j] ] | b b | d d | g g | k [ s | t | t j | t s ] | f ':0 f | l ':0 l | m ':0 m | n ':0 n | n ':0 n j | ŋ ':0 ŋ | r ':0 r | s ':0 s | s ':0 s j | v ':0 v ] ]; <spectie> the formalism is human-hostile <Unhammer> ^^^ and sed-hostile