Difference between revisions of "Talk:Agglutination"

From Apertium
Jump to navigation Jump to search
(Removing all content from page)
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{TOCD}}
In Hungarian a word has usually 2500 forms.
Therefore a Hungarian dictionary with all forms would contain 1 million * 2500 words,
that is 2.5 GWords, approx 20 GBytes, that can not be handled by computers
and handling it would make no sense.

Yes, hunspell handles that perfectly, it also handles vowel harmony.

What about Apertum to handle Hungarian? [[User:Muki987|Muki987]] 10:56, 6 April 2009 (UTC)

:I've played about with [[hunmorph]] — one of its limitations iirc is that it cannot do generation, only analysis. My personal preference for handling languages like Hungarian and Finnish etc. is to use something like [[SFST]] (see also [[Omorfi]]). The problem of course is then to get someone to write the actual code. - [[User:Francis Tyers|Francis Tyers]] 12:21, 6 April 2009 (UTC)

In Hungarian there is simply too much to generate. Not sure with Finnish&Turkish&Basque&Persian, but they also have a lot. I give you an example:

:Hello, there is not "simply too much to generate", there are languages much more agglutinative than Hungarian that have FST morphologies. For example see [http://www.morphologic.hu/downloads/publications/na/2008_lrec-saltmil_ural_na.pdf this paper]. - [[User:Francis Tyers|Francis Tyers]] 07:10, 7 April 2009 (UTC)

* ház (house)
* házhoz to the ..
* háztól from the..
* házig up to..
* háznak of the..
* háznál at the
* házba into..
* házban in the...
* házból from the...
* házról about ...
* házra on top of the...
* házon on the ....
* házzá become a ...
* házat it (accusativ) - 14
* házam (my house - repeat all previous to this like:) -- 28
** házamhoz ...
...
** házamat ...
* házad (your house repeat all previous to this) -42
* háza (his, her, its house repeat all previous to this) - 56
* házunk (our house repeat all previous to this) -- 70
* házatok (your house repeat all previous to this) - 84
* házuk (their house repeat all previous to this) - 98
* házé (of the house repeat all previous to this) - 112
* házamé (of my house -repeat all previous to this) - 126
...
* házuké (of their house -repeat all previous to this) - 210
* házak (plural - repeat all previous for this up to here) 420
...
* házacska ( a little house - repeat all prevoius up to here) 840
* házikó ( a little house- repeat all prevoius, except last) 1260
* házas (married- repeat all previous for this up to here, except the last 2) 1680

...
* you can see, it is almost trivial to get thousands of words just without a grammar book for each substantive.

In my opinion if we get a word, házaitokétól, (which is not unusual) we need an analysis tool, that shows:
* házaitokétól (from something of your houses)
* this is from ház
* it is plural
* it suits to the prefix "from"
* it suits to "plural you"
* the houses own something
* the owned thing is singular (otherwise it would be házaitokéitól)

With this knowledge we can construct the English (or Spanish, German, etc...) form. [[User:Muki987|Muki987]] 20:14, 6 April 2009 (UTC)

::I copied the diskussion with Jimregan onto my discussion page. We can continue there. [[User:Muki987|Muki987]] 10:14, 7 April 2009 (UTC)

==Comparison of Omorfi and Hunmorph==
===Omorfi===
http://www.ling.helsinki.fi/cgi-bin/omor/omorfi-cgi-demo.py

Omorfi - Demo of Finnish Morphology

These demos are based on the HFST implementation of Finnish morphology using SFST , and Nykysuomen sanalista . A guesser is used for missing words. For more information see HFST home page
Wordform Nykysuomen has no known analyses. The 6 best baseform and paradigm guesses were chosen:
<pre>
*1. Nykysuomen 32 noun sg nom
*2. Nykysuomi 7 noun sg acc
*3. Nykysuomi 7 noun sg gen
*4. Nykysuomi 7 noun sg ins
*5. Nykysuomi 25 noun sg acc
*6. Nykysuomi 25 noun sg gen
</pre>
As far as I can see here: http://wiki.apertium.org/wiki/Omorfi
<pre>
$ echo "kaikki ihmiset syntyvät vapaina ja tasavertaisina arvoltaan ja oikeuksiltaan." | fst-proc omorfi/src/omorfi.sfstc

^kaikki/kaikki<noun><7><a><sg><nom>$ ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$
^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>/syntyä<verb><52><j><act><pcpva><pl><nom>/syntyä<verb><52><j><act><indv><pres><pl3>$
^vapaina/vapaa<noun><17><pl><ess>$
^ja/*ja$ ^tasavertaisina/*tasavertaisina$ ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$ ^ja/*ja$
^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$.
</pre>

Omorfi also analyses and that's it. I do not see any difference to hunmorph, do you? [[User:Muki987|Muki987]] 21:24, 6 April 2009 (UTC)

===Hunmorph===
<pre>
$ echo "ablakot" | ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff --dic lexicons/morphdb.hu/out/morphdb_hu.dic
</pre>
and you get

<pre>
> ablakot ablak/NOUN>
</pre>

This is pretty much the same IMHO, what Omorfi produces. What so you think? [[User:Muki987|Muki987]] 21:28, 6 April 2009 (UTC)

::The difference is that in Omorfi, you can go the other way. From

:::<code>^syntyä<verb><52><j><act><pcpva><pl><acc>$</code> → <code>syntyvät</code>

::Can you do that in hunmorph? It was my understanding that you couldn't. - [[User:Francis Tyers|Francis Tyers]] 07:10, 7 April 2009 (UTC)

:::Btw, if you want to look at an agglutinative language pair currently in SVN, check out [[Northern Sámi and Lule Sámi]] &mdash; the transducers were generated from full-form lists, which is not the ideal way to do it. A better way would have been to somehow compile the XFST source code using a free compiler (for example SFST/HFST), but unfortunately that isn't possible yet :( - [[User:Francis Tyers|Francis Tyers]] 07:17, 7 April 2009 (UTC)

==Moses==
I compared Moses to Apertium, and as far as I can see, Apertium is much better, cleaner, more usable. Moses is like google translation, not bad in certain situations, but will never have acceptable quality. Unfortunately.
==Matxin==
Unfortunately I can not read Spanish. I can read English, Hungarian, German.
If anybody translated to English (using apertium,) would be a great help for me.

:We have Catalan→English, Welsh→English, Spanish→English, Galician→English. - [[User:Francis Tyers|Francis Tyers]] 07:12, 8 April 2009 (UTC)

===Hungarian-whatever (inflecting)===
However, I have an imagination of Hungarian-whatever-Hungarian translation.

A Hungarian sentence would be preprocessed, and each word would be added the morphological analysis result.

For example:
Ma megyek a házba. (Today (ma) I go (megyek) into the house (a házba))

I get with hunmorph for this:
> ma
ma/ADV
ma/NOUN
> megyek
megy/VERB<PERS<1>>
> a
a/ART
> házba
ház/NOUN<CAS<ILL>>

I would enter this to apertium,
I would expect a usable translation for non-agglutinative languages, and some output, that hunmorph could again translate back into more readable form for agglutinative languages.

:Not even any need for a pre-process, there is no reason why we cannot replace the Apertium morphological analyser with hunmorph (apart from the fact that it is written in Ocaml and slow!) ;). If you're interested in doing this I'll look at converting the output of hunmorph to apertium 'standard' (see [[Apertium stream format]]). - [[User:Francis Tyers|Francis Tyers]] 07:12, 8 April 2009 (UTC)

::I wonder why ocaml is that slow. It is a compilable language, being as fast, as C/C++. Also, hunpos, that delivers also word types, works much faster (is also an ocaml project!). [[User:Muki987|Muki987]] 12:10, 8 April 2009 (UTC)

::Is there somewhere a system documentation of apertium, that would speed up my understandig of it's structure and logic? I believe, apertium is the right project for me, and I should like get first a system overview. Thanks. [[User:Muki987|Muki987]] 12:10, 8 April 2009 (UTC)

===Whatever (inflecting) -Hungarian===
I go into the house

Apertium knows from the rules:
I go = megyek
the: a
Into the house: ház/NOUN<CAS<ILL>> ->hunlex or whatever translates into: házba

What do you think?
[[User:Muki987|Muki987]] 20:59, 7 April 2009 (UTC)

:Yep, no problem. - [[User:Francis Tyers|Francis Tyers]] 07:12, 8 April 2009 (UTC)

Latest revision as of 07:55, 7 July 2009