Talk:Hfst

From Apertium
Jump to navigation Jump to search

What is it?[edit]

<jacobEo> why use that?
<spectie> because it has a really expressive formalism for languages with complex morphology, like Finnish, Sami and Basque
<jacobEo> could you give an example of the most important thing it can do that lttoolbox cant?
<spectie> stem internal variation

<jacobEo> and how does it do that?
<spectie> by composing different transducers
<spectie> jacobEo, e.g. you have your lexical transducer, then you have your phonological transducer and you compose the two
<spectie> jacobEo, it's like the postgeneration in apertium, but much more integrated

<spectie> it's like taking care of "live^ed" --> "lived" and "jump^ed" --> "jumped"
<spectie> instead of having two paradigms for "live" and "jump" you would have one paradigm
<spectie> +ed
<spectie> then you would have a phonological rule that says "at morpheme boundaries , collapse e^e -> e
<jacobEo> k

<Unhammer> OTOH, if your verb paradigm looks like this:
<Unhammer> ROOT <p2><sg><pri>
<Unhammer> ROOT+t <p2><pl><pri>
<Unhammer> v+ROOT <p1><sg><pri>
<Unhammer> v+ROOT+t <p1><pl><pri>
<Unhammer> da+v+ROOT <p1><sg><fut>
<Unhammer> da+v+ROOT+t <p1><pl><fut>
<Unhammer> you might want to consider hfst ;)
<spectie> yeah :D

<jacobEo> so its much slower, I suppose
<jimregan> nah
<jimregan> slower to compile, sure

<jimregan> you provide definitions for things like what is a vowel, what is a consonant, and where umlauting happens and what it is
<jimregan> ...in nightmarish syntax that escaped from the 70s
<spectie> http://paste2.org/p/532099

 LowerG2   = [
               [
                  (Cns:0)  LCnsPhon7 (Cns:0) LCnsPhon (Cns:0)            ! xy that cannot be G3, since x cannot form xy G3.
               |                                                         ! nijbe,
                  [           ! This section is for 3-cons G2. and for 2cns G2 that share the initial cns with 2cns G3
                     (Cns:0) [:j|:l|:m|:n|:v] (Cns:0) :s :t                  ! S9, 3cns-G2 bäjstov, etc.
                  |
                     (Cns:0) [ :l | :r | :n | :j ]      (Cns:0) :s :k                   ! S9, 3cns-G2 sválskes, etc.
                  |
                     (Cns:0) [  ! 2cns G2 that share the initial cns wit 2cns G3
                                :b (Cns:0) [ :d | :m | :j | :l | :n | :n :j | :r | :s | :t :j | :t :s  ] ! S9, initial b
                                |                                                                        ! gábdev
                                :d (Cns:0) [ :j | :n | :n :j ]                                           ! S7, initial d
                                |                                                                        ! iednev
                                :g (Cns:0) :ŋ                                                            ! S7, initial g
                                |                                                                        ! låg0ŋot
                                :k (Cns:0) [ :n | :k ]                                                   ! S7, initial k
                                |
                                :g (Cns:0) :n                                                            !däggna:degna
                             ]                                                                           !
                             |
                             :r (Cns:0)   :s :j :t                                                    ! S9, rsjt, bårsjtav
                  ]
               ]
                 - [
                     [ d t [s|j] ] |  b b | d d | g g | k [ s | t | t j | t s ]  |
                     f ':0 f | l ':0 l | m ':0 m | n ':0 n | n ':0 n j | ŋ ':0 ŋ | r ':0 r | s ':0 s | s ':0 s j | v ':0 v
                     ]
             ];


<spectie> the formalism is human-hostile
<Unhammer> ^^^ and sed-hostile
<spectie> but really awesome... in the modern and biblical senses of the word :)


human-hostility...[edit]

Wow! A short bacground, from one of the three authors of the LowerG2 definition (and yes, it will take me some time and svnlog reading to reconstruct it...). The point was that Lule Sámi has diphthong simplification, but not across grade 3 (in a 3-grade consonant gradation system). The difference between the 3 grades cannot be determined on the basis of letter (or even phoneme) counting, therefore the cumbersome definitions, by the help of which we later wrote rules to block diphthong simplification across G3. Believe it or not, but te setup was for readability... The S3, S9 etc. refer to the list of consonant gradation types in Spiiks Lulesamisk Grammatik. As for the nightmarishness of the syntax, what we see here is actually ordinary regex syntax (with "a:b c" meaning "first segment upper a lower b then segment upper and lower c"). Part of it may thus even be from the 60ies..., but yes, definite with us also today. But for non-segmental morphology, twol/hfst or foma is simply what it takes :-) Trondtr 19:03, 29 March 2010 (UTC).