Apertium has moved from SourceForge to GitHub.
If you have any questions, please come and talk to us on #apertium on irc.freenode.net or contact the GitHub migration team.

User talk:Muki987

From Apertium
Revision as of 15:13, 9 April 2009 by Jimregan (Talk | contribs)

Jump to: navigation, search


Jimregan's remarks

'With this knowledge we can construct the English' -- How? You don't seem to have given thought to that part.
'háza (his, her, its house repeat all previous to this) - 56' -- it strikes me as a) unlikely that you can chain all possible possessives in this manner and b) that you can do something useful that will convey an understandable meaning in another language even if it is.
'házas (married- repeat all previous for this up to here, except the last 2) 1680' -- a married house? Really?
'házacska' -- are there no lexicalised diminutives in Hungarian? I can theoretically add '-let' to any noun in English, but 'piglet' has a separate translation to most languages, and 'hamlet' is not a diminutive of 'ham'.
Just because you can theoretically infer meaning from an analysis doesn't mean that results will translate. -- Jimregan 05:24, 7 April 2009 (UTC)

To Jimregan

>'With this knowledge we can construct the English' -- How? You don't seem to have given thought to that part.

Of course I did. Whatever I can do as a human translator, the machine can also do, if I tell him how. I am absolutely optimistic in the fact, and looking for the proper technology.

Great. 'To a person with a hammer, all problems look like nails'. I say that Hunmorph is your hammer; you are mixing derivational processes with agglutination and 'normal' morphology. Just because all of these things can be treated the same doesn't mean it always makes sense to do so, which is the point underlying everything I said. -- Jimregan 12:57, 7 April 2009 (UTC)
Your pont remains unclear for me, but it might be not worth to seek for clearness in this case, since your text seems to be philosophical for me. I am a practicing person, less philosophic type. I am rather new to practicing Hunmorph, anyway. Muki987 18:16, 7 April 2009 (UTC)
My point is that you have one solution to one problem; you're trying to use that solution for other problems. Clear? -- Jimregan 10:46, 8 April 2009 (UTC)
Sorry, not. Please explain what you want to say more detailed with examples. Also explain, why are you saying that. Muki987 11:49, 8 April 2009 (UTC)
I'll make it as simple as possible: you think there's only one problem; there are many more. You are ignoring them because you have one solution, and think it will work for them all. It won't -- Jimregan 15:52, 8 April 2009 (UTC)
You misunderstand me completely. I very clearly see, we have lots of things to do, I just address one of them, that's all. The main one for me at the moment. If that is fixed, I'll continue with the rest, or even better, we have a lot of commonly solvable problems, and we solve together the rest. What I addressed, is no problem for prefix type language pairs, but very clearly a problem for me. My primer focus is English-Hungarian, German-Hungarian, second English-German, third German-English, Fourth Hungarian-German, Hungarian-English. The other option is, we say, it is impossible to write a translator, I think, that is simply wrong. Muki987 18:14, 8 April 2009 (UTC)
If I misunderstood you; good. Because it seemed clear to me that you were ignoring other issues, thinking that they would all be solved by using HunMorph. I'd much rather make you angry now than see you work for months, only to find you have to redo everything. For English-Hungarian, you will most likely find it easier to treat certain types of words - derivatives - as separate words, rather than forms of a base word. It's still possible to do otherwise, but it causes a lot of unnecessary complication, and will have undesired side-effects. I find it hard to believe that Hungarian-German will be much different. (Or, honestly, anything other than Hungarian-Finnish and Hungarian-Estonian).
Yes,I agree with all that. Especially with the remark, German-Hungarian relation compared to English-German relation. I found while translating lots of texts from both, I was faced almost all the time with the same problems. The only exception is in English the lots of meanings of the same word, that fortunately is not the case for German or Hungarian. In English-German relation the only grammatical problem I saw, (besides using of the false words or expressions, which is a general problem for any language pair) is the position of verbs, that tend to be often at the end of a structure, while in English in the middle of it.
Er dachte, sie würde in die Schule gehen.
He thought, she would go to school.
Er dachte, sie würde gehen in die Schule --is unusual in German, and sounds un-German. One can see such bad structures in swallow translated texts.
I believe, Apertium has standard tools to fix this. I also think, to handle words near to the stem is simpler, than try to complicate our life with derivatives (Ház-házas- házas can, and should clearly be considered az an independent word.). My example was just set up to illustrate the great number of derivatives (I forgot even some), and not to suggest any special way to translate that word. Muki987 09:31, 9 April 2009 (UTC)
Unfortunately, we don't have an effective way of dealing with this example -- but, we (well, Francis and I) recently learned something about our transfer architecture that can possibly be used to deal with this (n-level transfer), but neither of us have had a chance to experiment with it yet. (At least, I haven't; perhaps Francis has). -- Jimregan 10:58, 9 April 2009 (UTC)
Word order is very important, not only in English-German, but also in German-Hungarian relation.
He thought, she would go to school.
Azt hitte, iskolába megy. (azt=that, hitte=thought, iskolába=to school, megy go), when we want to say, she goes to school, and not elsewhere.
Azt hitte, megy az iskolába- when we want to say, she went and did not fly or swim).
This also exists in German, we put to the begin the part, we want to stress.
In die Schule wollte sie gehen. (To school she wanted to go)
Gehen wollte sie in die Schule. (She wanted to go to school) Muki987 11:18, 9 April 2009 (UTC)
Have a look at apertium-eu-es (Basque->Spanish). It's one direction only, but using HunMorph would limit you to only being able to translate from Hungarian anyway. (AFAIK, the main reason eu-es is one direction only is because Matxin already exists for the other direction) -- Jimregan 09:01, 9 April 2009 (UTC)
Yes, I am doing that at the moment, thanks. My priority is, as you know, English-Hungarian, German-Hungarian first. Muki987 09:31, 9 April 2009 (UTC)

>'háza (his, her, its house repeat all previous to this) - 56' -- it strikes me as a) unlikely that you can chain all possible possessives in this manner and b) that you can do something useful that will convey an understandable meaning in another language even if it is.

ház- házam, házad, háza, házunk, házatok házuk (my house, your house, his, her its house, our house, your hous their house) All relations to MY HOUSE are then expressed, as in the case of ház: házban- házamban házra- házamra etc... It is simple and understandable in all cultur languages.

You aren't addressing my point. 'repeat all previous to this', implying that you can have some combination meaning 'my your his their house'. -- Jimregan 12:57, 7 April 2009 (UTC)
Repeat all previus means, that I can express the relations to "my house" "your house" ... "Their house" by using exactly the same inflects, as for "house". Above the example with "ban" = in , all others work exactly on the same way. Muki987 18:16, 7 April 2009 (UTC)
So, referring just to grammatical cases? Ok, that answers my question -- Jimregan 10:46, 8 April 2009 (UTC)
If you want to express it like that. Neither English, nor Hungarian have grammatical cases in fact, just to be precise. Muki987 11:56, 8 April 2009 (UTC)
Hungarian does have grammatical cases; 'I usually quote 17 following those established by Antal László in 1977' -- the first group in your set of examples are grammatical cases -- Jimregan 15:52, 8 April 2009 (UTC)
If I was in you, I were much more modest in my statements. Antal László's linguistic ideas are disputable. In Hungary, nobody speaks about n cases, because that is simply contraproductive. It is also contraproductive for foreigners, if they learn Hungarian. I see now, it is useful for translation, so I will use the concept, but for this purpose only. Muki987 18:14, 8 April 2009 (UTC)
You misunderstand me here; I was not being immodest: all of the literature I could find in English is in agreement. It may be the case that the views are disputed, but that is not represented in English writings about Hungarian grammar, as far as I have seen at least. How are they considered, then? Because it may be the case that it could be easier to translate to and from Hungarian if the set of suffixes I would regard as case endings were instead treated as enclitic postfixes (that is, by splitting off the suffix and treating it as if it was a separate word: see, for example, how 'dímelo' in Spanish is split into 'decir<vblex><imp><p2><sg>+prpers<prn><enc><p1><mf><sg>+lo<prn><enc><p3><nt>'. Does Hungarian have vowel harmony, like Finnish? That may complicate things, but I think there's a (relatively) easy way around it. -- Jimregan 09:01, 9 April 2009 (UTC)
I see. You consider English literature as authoritativ for Hungarian grammar questions? I would not do that. It is written in fact by analphabets from the linguistic point of view. Authoritative are in my opinion are only mother tongue authors, who agree with most of the other Hungarian mother tongue linguists. (of course, this includes linguists of the past also, not only at present).
No, no; I don't speak Hungarian, so I can't check the literature in Hungarian: I have to rely on literature in English. -- Jimregan 10:52, 9 April 2009 (UTC)
You are however, IMHO absolutely correct, if you say, we must classify postfixes for translation purposes, since we MUST find a way to match our postfixes to the prepositions, and therefore we must classify them. We can call the classification anything, IMHO the best name is classification, but we can call them also cases, which has very little to do with German type cases.
Great. See Francis' example, below: this is basically what I propose, for analysis. For generation, I propose taking that 'pseudo word' system, and converting it into a string of tags, much like HunMorph generates, in a set of Hungarian-only rules (most of our rules are based on knowledge of both languages, but this set could be reused among language pairs using Hungarian). -- Jimregan 10:52, 9 April 2009 (UTC)
Yes we do have vowel harmony, however, this is almost trivial: high words get high ending, low low endings, there are a few exceptions, that can be handled by rules. eéiíöüõû are high, aáouóú are low vocales. Every postfix has low and high form, for example ba, be (into) ajtó-ba, szék-be. If the word is mixed, (mixed are typically words taken over from foreign languages) for example radio, either the last syllable decides or we use low, for example rádió-ba. Exceptions are some ancient words for etymological reasons, for example derék, derékba, íj, íjba, they are only a handful words, no problem. Muki987 09:31, 9 April 2009 (UTC)
Well, exceptions are exceptions, and every language has them. You're right; vowel harmony is not a big problem (at least, not in my opinion) -- but it does mean I need to ask you for more examples, as one set of suffixes is not enough. I know I can take them from Hunspell/Hunmorph (in fact, it would be a requirement IMO, to be able to reuse that data as quickly and easily as possible), but I'd rather focus on one small dataset to begin with, and expand later -- Jimregan 10:52, 9 April 2009 (UTC)


This conversation is a bit heated for me, but note that in our Basque→Spanish system we do something similar with Basque cases. For example, a typical way of representing "hegoak" would be:


Note how in our representation the case is marked as a postfix k<post> where in the more traditional analysis it is marked as a case ERG (ergative). Compared with Basque→Spanish, Hungarian→English would be easier in terms of word order:

S                                         O                      V
Txinako Poliziak, datu ofizialen arabera, 1.317 pertsona atzeman zituen 
la Policía de China, según los datos oficiales, 1.317 personas capturó
The Chinese police, according to official data, 1,317 people detained.

`According to official data the Chinese police detained 1,317 people.'

- Francis Tyers 10:13, 9 April 2009 (UTC)

I think, the discussion gets calmer, since I start to see the pont of JimRegan, (and he mine) which is helpful and valid. I see, Matxin can handle both Basque-Spanish and Spanish-Basque, so I'll look throughoutly into that. Basque is also a hun language, as far as I know, very similar to Hungarian. Muki987 10:32, 9 April 2009 (UTC)
Actually, the Matxin system cannot handle Basque→Spanish, as there is no dependency analysis for Basque. Apertium is used for Basque→Spanish and Matxin for Spanish→Basque. As far as I know, Basque does not have any living relatives. - Francis Tyers 10:44, 9 April 2009 (UTC)
That's important for me to know, thanks. Muki987 11:30, 9 April 2009 (UTC)
Is there any difference between the main diagram "How Apertium works" between Apertium and Matxin? If yes, where, if not: What is the difference between Matxin and Apertium (except of character coding)? Muki987 12:37, 9 April 2009 (UTC)
Basque has lots of living relatives, Hungarian, Armenian, Turkish, Aserbaidshan, Uigur, Finnish, Estonian, Persian, Japanese (thru Ainu = hunnish influence), Ketchua (Inka language in south America), ancient Egyptian (no more living, but hieroglyphes show a great past), Etruscian (also no more living, but great past), Hindi, and more. Muki987 11:30, 9 April 2009 (UTC)
I do not agree, although the issue is not really pertinent to our current discussion. - Francis Tyers 11:53, 9 April 2009 (UTC)

Considerations for prefix groups and possessions

Prefix groups

In English, one prefix can handle more nouns, for example: I travel to England, France and Spain. This will be translated as: Utazom Angliá-ba, Franciaország-ba és Spanyolország-ba. ("-" added for clarification). Utazom: I travel, Angliába: to England, ... , Spanyolországba: to Spain

In English the prefix ...nouns structure will be closed by:

  • a dot (finishing the sentence)
  • a verb - I travel to england an spain and will carry a bag- the word "will" closes the scope of to.
  • a new prefix - I travel to england and spain with train or aeroplane- the word "with" closes the scope of to.
Co-ordinated noun phrase with case agreement. I would probably do this kind of thing in pre-transfer with a constraint grammar. Basically write a rule which does: "add accusative case to nouns following the preposition 'to' until a new preposition, verb or end-of-sentence". - Francis Tyers 10:22, 9 April 2009 (UTC)


In English the possessor may be before the possesion: Peter's coffee and tee

but also behind it: the coffee and tee of Peter

In Hungarian the possessor is always strictly before the possesion, both sentences above must be translated as: Péter kávé-ja és teá-ja. (again "-" just for clarity).

In English the possession structure will be closed by

  • a dot (finishing the sentence)
  • a verb - Peters coffee and tee looks like a bag - the word "looks" closes the scope of possession structure
  • a new prefix - Peters coffee and tee with sugar - the word "with" closes the scope of possession structure

In the case of "the coffee and tee of Peter" type possession relation: If an noun enumeration starts, the translator must watch. If the enumeration ends with "of", this is a possession structure, and must be translated, os such.

Combination of possession and prefix

  • With Peter's coffee and tee - Péter kávé-já-val és teá-já-val - ja is possession, val, vel is with
  • With the coffe and tee of Peter - as above

Adding plural

  • With Peter's coffee and tees - Péter kávé-já-val és teá-i-val - "i" is plural possession for tea
Oh. That's interesting, that plurality 'goes with' the possessive. Not really an extra problem, but it is interesting. -- Jimregan 11:31, 9 April 2009 (UTC)


These kind of structures caused for me the most manual work when translated texts from English/German, therefore it is very important to set up their proper translation. Thanks in advance for any critics/thought/comments. Muki987 10:12, 9 April 2009 (UTC)

Yes; they pose quite a problem, because the phrase boundary needs to be detected: in 'Peter's coffee and tea', 'coffee and tea' is the part that's possessed, but in the sentence 'I drank Peter's coffee and tea was spilled on the ground' only 'coffee' is possessed. We can use CG to add boundaries here, but it will be a lot of work. -- Jimregan 11:29, 9 April 2009 (UTC)
An interesting example. This is a fourth kind of structure closing signal: noun immediately followed a verb also stops the structure:
I drank Peter's coffee and children played near to us.
I saw Peter's coffee and tee smell like sugar - this sentence is even in English is ambiquous - what smells like sugar, both or only tee? Would a comma after coffee limit possession to coffee? Muki987 11:48, 9 April 2009 (UTC)
I'm not sure if this is ambiguous. The ambiguity is resolved by the inflection of the verb in this particular case.
?I saw Peter's coffee and tee smell like sugar
I saw [Peter's coffee] and [tea] smell like sugar
I saw [Peter's coffee and tea] smell like sugar
- Francis Tyers 13:36, 9 April 2009 (UTC)
This particular example aside, the premise is sound: English has ambiguities can cause difficulties in determining phrase boundaries. -- Jimregan 14:13, 9 April 2009 (UTC)
Yes. Apertium's transfer works on left to right longest match. We were hoping that someone would be interested in integrating CG's dependency analysis for GSoC, which would help to resolve these ambiguities; at the moment, we have to simply pick the most common cases, and fail in others. -- Jimregan 13:32, 9 April 2009 (UTC)

Further subjects

>'házas (married- repeat all previous for this up to here, except the last 2) 1680' -- a married house? Really?

That word is a bit exception, since it has two meanings házas means married, and also a man/woman, who has a house In case if ing (shirt) inges means someone, who wears a shirt

Ah; now I see what you mean. I thought you meant that the suffix meant married, not the word -- Jimregan 10:46, 8 April 2009 (UTC)
Then it's a derivation, and better treated as a separate word. -- Jimregan 12:57, 7 April 2009 (UTC)

>'házacska' -- are there no lexicalised diminutives in Hungarian? I can theoretically add '-let' to any noun in English, but 'piglet' has a separate translation to most languages, and 'hamlet' is not a diminutive of 'ham'.

acska or ikó is the diminutive. It is the same thing as pig-piglet.

I know what a diminutive is; did you understand my question? 'piglet' is a diminutive of pig, but it is a separate word in its own right, which would have its own translation -- it is lexicalised. Many (most) other diminutives are unproductive, and can be safely treated in terms of the original word. -- Jimregan 12:57, 7 April 2009 (UTC)
Yes, there might be some words, whose diminutive form modifies the original word's meaning, however, I can't think even a single one at the moment in Hungarian. Piglet means little pig or a child pig. What do you want with these words and examples? English is very hard to translate due to tens of very different meanings of lots of words, like prime and the like. This is a very specific English problem, Hungarian or German do not have it. Are you addressing this problem? If yes, can you see any practical solution for this? Muki987 18:16, 7 April 2009 (UTC)
You're changing the issue again. If you want a German example; 'piglet' should be translated as 'Ferkel', not 'Schweinchen'; 'Mädchen', which is a lexicalised diminutive, should not be considered a form of 'Mäd'.
Word sense disambiguation is not a problem specific to English. -- Jimregan 10:46, 8 April 2009 (UTC)
Not specific to English, but sharper in English, than in any other cultur language. What about your ideas to solve it? Muki987 11:54, 8 April 2009 (UTC)
Word sense disambiguation -- Jimregan 15:52, 8 April 2009 (UTC)
We don't currently have a good working lexical selection module, but it is one of the ideas we're hoping to get implemented through GSOC. - Francis Tyers 21:48, 8 April 2009 (UTC)

From Wikipedia:

In some cases, the diminutive suffix has become part of the basic form. These are no longer regarded as diminutive forms:


  • -ka/ke: fóka (seal), róka (fox), csóka (jackdaw), pulyka (turkey), szarka (magpie)
  • -cska/cske: macska (cat), kecske (goat), fecske (swallow), szöcske (grasshopper)

...which answers my question; yes, Hungarian does have lexicalised diminutives. -- Jimregan 10:50, 8 April 2009 (UTC)

You see, you get better answers in fickipedia. You are right, this is an issue for translations, however one of the issues, that can easily be covered. Muki987 11:54, 8 April 2009 (UTC)
Yes; it's an issue; one that you weren't considering. -- Jimregan 15:52, 8 April 2009 (UTC)
Sure, and a lot of others also not. One after the other. Muki987 18:14, 8 April 2009 (UTC)

>Just because you can theoretically infer meaning from an analysis doesn't mean that results will translate. -- Jimregan 05:24, 7 April 2009 (UTC)

I translated so much already, that I can say: You can not say anything in any cultur language, that can not be translated into an other one.

I hope, you do not want to stress that there are untranslatable things? I would strongly disagree with that assumption, and would ask you to give me at least one example. Muki987 08:13, 7 April 2009 (UTC)

Continuation JR

Ok, let me refine what I meant: the results won't translate in a meaningful way. There are all sorts of ways of inferring from derivational processes what a word 'means', but they tend to be useful only to linguists/translators who can then determine the best way to represent that in the target language.
Yes, there are certain words that are not directly translatable between languages: their concepts may be conveyed in other ways, but it's an explanation, not a translation. -- Jimregan 12:57, 7 April 2009 (UTC)
I call explanation on the target language a way to translate it. For example in German Hammelsprung means a sort of voting, when those, who say yes, exit the room using some doors, those, who say no, on some others. This can IMHO not directly be translated on any language, but must be explained; I call then the explanation translation, what it is. What do you think? Muki987 18:09, 7 April 2009 (UTC)
'Hammelsprung' -> 'parliamentary division', or just 'division', in context. That's the kind of translation MT should give: something as closely equivalent as possible, that fits into the same context. Your long explanation doesn't. -- Jimregan 11:02, 8 April 2009 (UTC)
Well, that might be the case for German-English, but as far as I know, not the case for German-Hungarian. C'est la vie. Muki987 12:00, 8 April 2009 (UTC)
Maybe. Still, it's better to use something shorter, that fits into the same general category, than to give a long winded explanation. -- Jimregan 15:55, 8 April 2009 (UTC)
The shortest possible explanation, but it must be understandable for every reader Muki987 18:17, 8 April 2009 (UTC)
By that, yes, you're right that if a 'close equivalent' is used, that should be as understandable as possible. On the other hand, it's perfectly acceptable to use specific terminology, which may not be understandable to everyone. -- Jimregan 08:38, 9 April 2009 (UTC)
Yes, if one exists. I doubt, we have something in Hungarian, but I might be wrong. I can imagine "kimenõs szavazás", (voting by/at leaving) but I would not understand that without further explanations, Muki987 09:02, 9 April 2009 (UTC)
This is rather off the topic of the discussion, this page is more to discuss methods of representing agglutinative morphology in Apertium, rather than the translation problems of agglutinative languages (which are also interesting, but better reserved for another page, or the mailing list). :) - Francis Tyers 08:21, 7 April 2009 (UTC)
Glad to hear, that you are convinced, apertium technology is suitable for agglutinative languages. Having gone thru the English-SerboCroatian example I was not that sure. I am at the moment in the evaluation phase, and I am looking for all existing technologies. At present in my opinion google translation technology with its statistical, grammar free approach will never have the quality of a grammar oriented one, like apertium. It will for ever remain on the surface, with no real improvement perspective. However, for some situations it is very helpful. That was my first step in the direction. We can continue this subject on my discussion page, if Jimregan wants. Muki987 10:02, 7 April 2009 (UTC)
Regarding other free grammar-focussed MT engines, you might also check out and Matxin. Open Logos has the downside of not supporting UTF-8 and not having very active development, while Matxin requires a dependency grammar to be written in Freeling format. If you want to go from English→Hungarian then this might be the answer, as they already have one written for English, but for Hungarian→English, it might take some extra development time. The Constraint grammar formalism for disambiguation and syntactic annotation might also be interesting. I'm quite happy to discuss other options and if you have any questions, please contact us on the mailing list, personally or through IRC. - Francis Tyers 10:36, 7 April 2009 (UTC)
PS. Are you the one asking on the hunmorph list about generation ('morp visszafele')? :) - Francis Tyers 12:00, 7 April 2009 (UTC)
Yes, I am an old language "rabbit" :-) Peter H. says, hunlex knows something similar, we are waiting for Victor, the author, he might know..... Muki987 18:04, 7 April 2009 (UTC)


Perhaps we could make a page of free resources for Hungarian ? - Francis Tyers 12:59, 9 April 2009 (UTC)

Sure, why not. As I go ahead, I'll think of the idea, and collect things. Muki987 13:31, 9 April 2009 (UTC)

Personal tools