Difference between revisions of "Talk:Agglutination"

From Apertium
Jump to navigation Jump to search
Line 98: Line 98:


::This is rather off the topic of the discussion, this page is more to discuss methods of representing agglutinative morphology in Apertium, rather than the translation problems of agglutinative languages (which are also interesting, but better reserved for another page, or the [[contact|mailing list]]). :) - [[User:Francis Tyers|Francis Tyers]] 08:21, 7 April 2009 (UTC)
::This is rather off the topic of the discussion, this page is more to discuss methods of representing agglutinative morphology in Apertium, rather than the translation problems of agglutinative languages (which are also interesting, but better reserved for another page, or the [[contact|mailing list]]). :) - [[User:Francis Tyers|Francis Tyers]] 08:21, 7 April 2009 (UTC)

:::Glad to hear, that you are convinced, apertium technology is suitable for agglutinative languages. Having gone thru the English-SerboCroatian example I was not that sure. I am at the moment in the evaluation phase, and I am looking for all existing technologies. At present in my opinion google translation technology with its statistical, grammar free approach will never have the quality of a grammar oriented one, like apertium. It will for ever remain on the surface, with no real improvement perspective. However, for some situations it is very helpful. That was my first step in the direction. We can continue this subject on my discussion page, if Jimregan wants. [[User:Muki987|Muki987]] 10:02, 7 April 2009 (UTC)


==Comparison of Omorfi and Hunmorph==
==Comparison of Omorfi and Hunmorph==

Revision as of 10:02, 7 April 2009

In Hungarian a word has usually 2500 forms. Therefore a Hungarian dictionary with all forms would contain 1 million * 2500 words, that is 2.5 GWords, approx 20 GBytes, that can not be handled by computers and handling it would make no sense.

Yes, hunspell handles that perfectly, it also handles vowel harmony.

What about Apertum to handle Hungarian? Muki987 10:56, 6 April 2009 (UTC)

I've played about with hunmorph — one of its limitations iirc is that it cannot do generation, only analysis. My personal preference for handling languages like Hungarian and Finnish etc. is to use something like SFST (see also Omorfi). The problem of course is then to get someone to write the actual code. - Francis Tyers 12:21, 6 April 2009 (UTC)

In Hungarian there is simply too much to generate. Not sure with Finnish&Turkish&Basque&Persian, but they also have a lot. I give you an example:

Hello, there is not "simply too much to generate", there are languages much more agglutinative than Hungarian that have FST morphologies. For example see this paper. - Francis Tyers 07:10, 7 April 2009 (UTC)
  • ház (house)
  • házhoz to the ..
  • háztól from the..
  • házig up to..
  • háznak of the..
  • háznál at the
  • házba into..
  • házban in the...
  • házból from the...
  • házról about ...
  • házra on top of the...
  • házon on the ....
  • házzá become a ...
  • házat it (accusativ) - 14
  • házam (my house - repeat all previous to this like:) -- 28
    • házamhoz ...

...

    • házamat ...
  • házad (your house repeat all previous to this) -42
  • háza (his, her, its house repeat all previous to this) - 56
  • házunk (our house repeat all previous to this) -- 70
  • házatok (your house repeat all previous to this) - 84
  • házuk (their house repeat all previous to this) - 98
  • házé (of the house repeat all previous to this) - 112
  • házamé (of my house -repeat all previous to this) - 126

...

  • házuké (of their house -repeat all previous to this) - 210
  • házak (plural - repeat all previous for this up to here) 420

...

  • házacska ( a little house - repeat all prevoius up to here) 840
  • házikó ( a little house- repeat all prevoius, except last) 1260
  • házas (married- repeat all previous for this up to here, except the last 2) 1680

...

  • you can see, it is almost trivial to get thousands of words just without a grammar book for each substantive.

In my opinion if we get a word, házaitokétól, (which is not unusual) we need an analysis tool, that shows:

  • házaitokétól (from something of your houses)
  • this is from ház
  • it is plural
  • it suits to the prefix "from"
  • it suits to "plural you"
  • the houses own something
  • the owned thing is singular (otherwise it would be házaitokéitól)

With this knowledge we can construct the English (or Spanish, German, etc...) form. Muki987 20:14, 6 April 2009 (UTC)

'With this knowledge we can construct the English' -- How? You don't seem to have given thought to that part.
'háza (his, her, its house repeat all previous to this) - 56' -- it strikes me as a) unlikely that you can chain all possible possessives in this manner and b) that you can do something useful that will convey an understandable meaning in another language even if it is.
'házas (married- repeat all previous for this up to here, except the last 2) 1680' -- a married house? Really?
'házacska' -- are there no lexicalised diminutives in Hungarian? I can theoretically add '-let' to any noun in English, but 'piglet' has a separate translation to most languages, and 'hamlet' is not a diminutive of 'ham'.
Just because you can theoretically infer meaning from an analysis doesn't mean that results will translate. -- Jimregan 05:24, 7 April 2009 (UTC)

To Jimregan

>'With this knowledge we can construct the English' -- How? You don't seem to have given thought to that part.

Of course I did. Whatever I can do as a human translator, the machine can also do, if I tell him how. I am absolutely optimistic in the fact, and looking for the proper technology.

>'háza (his, her, its house repeat all previous to this) - 56' -- it strikes me as a) unlikely that you can chain all possible possessives in this manner and b) that you can do something useful that will convey an understandable meaning in another language even if it is.

ház- házam, házad, háza, házunk, házatok házuk (my house, your house, his, her its house, our house, your hous their house) All relations to MY HOUSE are then expressed, as in the case of ház: házban- házamban házra- házamra etc... It is simple and understandable in all cultur languages.

>'házas (married- repeat all previous for this up to here, except the last 2) 1680' -- a married house? Really?

That word is a bit exception, since it has two meanings házas means married, and also a man/woman, who has a house In case if ing (shirt) inges means someone, who wears a shirt

>'házacska' -- are there no lexicalised diminutives in Hungarian? I can theoretically add '-let' to any noun in English, but 'piglet' has a separate translation to most languages, and 'hamlet' is not a diminutive of 'ham'.

acska or ikó is the diminutive. It is the same thing as pig-piglet.

>Just because you can theoretically infer meaning from an analysis doesn't mean that results will translate. -- Jimregan 05:24, 7 April 2009 (UTC)

I translated so much already, that I can say: You can not say anything in any cultur language, that can not be translated into an other one.

I hope, you do not want to stress that there are untranslatable things? I would strongly disagree with that assumption, and would ask you to give me at least one example. Muki987 08:13, 7 April 2009 (UTC)

This is rather off the topic of the discussion, this page is more to discuss methods of representing agglutinative morphology in Apertium, rather than the translation problems of agglutinative languages (which are also interesting, but better reserved for another page, or the mailing list). :) - Francis Tyers 08:21, 7 April 2009 (UTC)
Glad to hear, that you are convinced, apertium technology is suitable for agglutinative languages. Having gone thru the English-SerboCroatian example I was not that sure. I am at the moment in the evaluation phase, and I am looking for all existing technologies. At present in my opinion google translation technology with its statistical, grammar free approach will never have the quality of a grammar oriented one, like apertium. It will for ever remain on the surface, with no real improvement perspective. However, for some situations it is very helpful. That was my first step in the direction. We can continue this subject on my discussion page, if Jimregan wants. Muki987 10:02, 7 April 2009 (UTC)

Comparison of Omorfi and Hunmorph

Omorfi

http://www.ling.helsinki.fi/cgi-bin/omor/omorfi-cgi-demo.py

Omorfi - Demo of Finnish Morphology

These demos are based on the HFST implementation of Finnish morphology using SFST , and Nykysuomen sanalista . A guesser is used for missing words. For more information see HFST home page Wordform Nykysuomen has no known analyses. The 6 best baseform and paradigm guesses were chosen:

*1. 	Nykysuomen 	32 noun 	sg nom
*2. 	Nykysuomi 	7 noun 	sg acc
*3. 	Nykysuomi 	7 noun 	sg gen
*4. 	Nykysuomi 	7 noun 	sg ins
*5. 	Nykysuomi 	25 noun 	sg acc
*6. 	Nykysuomi 	25 noun 	sg gen

As far as I can see here: http://wiki.apertium.org/wiki/Omorfi

$ echo "kaikki ihmiset syntyvät vapaina ja tasavertaisina arvoltaan ja oikeuksiltaan." | fst-proc omorfi/src/omorfi.sfstc

^kaikki/kaikki<noun><7><a><sg><nom>$ ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$ 
^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>/syntyä<verb><52><j><act><pcpva><pl><nom>/syntyä<verb><52><j><act><indv><pres><pl3>$ 
^vapaina/vapaa<noun><17><pl><ess>$ 
^ja/*ja$ ^tasavertaisina/*tasavertaisina$ ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$ ^ja/*ja$ 
^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$.

Omorfi also analyses and that's it. I do not see any difference to hunmorph, do you? Muki987 21:24, 6 April 2009 (UTC)

Hunmorph

$ echo "ablakot" | ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff --dic lexicons/morphdb.hu/out/morphdb_hu.dic

and you get

> ablakot ablak/NOUN> 

This is pretty much the same IMHO, what Omorfi produces. What so you think? Muki987 21:28, 6 April 2009 (UTC)

The difference is that in Omorfi, you can go the other way. From
^syntyä<verb><52><j><act><pcpva><pl><acc>$syntyvät
Can you do that in hunmorph? It was my understanding that you couldn't. - Francis Tyers 07:10, 7 April 2009 (UTC)
Btw, if you want to look at an agglutinative language pair currently in SVN, check out Northern Sámi and Lule Sámi — the transducers were generated from full-form lists, which is not the ideal way to do it. A better way would have been to somehow compile the XFST source code using a free compiler (for example SFST/HFST), but unfortunately that isn't possible yet :( - Francis Tyers 07:17, 7 April 2009 (UTC)