User talk:Francis Tyers

From Apertium
Jump to navigation Jump to search

Francis, there's an import and export feature of the mediawiki engine, If you tweak it somehow, we may edit the whole dictionary articles here in the wiki and simply export it to Apertium xml format!!! And this will make the whole process unbelievably simpler. We may also utilize the template facilites of the wiki.

Hi Francis, thanks for the message. Üorked on the table for Pivə. Will work more later. Great project. Good luck. --Mehrdad 19:11, 1 September 2007 (BST)
Really glad to see this project,and hope can contribute more in the future. I have to admit though that although I am a native speaker of Azerbaijani, I had no formal education in this language so I may be wrong on some cases. Yes I have a MSN account and will send you the id via email. --Mehrdad 11:24, 4 September 2007 (BST)

Hi. I don't know about 'awesome' yet - I have a bug or two to work out ("Error: Unsupported transducer type for ."), some tag reordering to do, and a heck of a lot more pardefs to write (Polish morphology is... extensive :)

Actually, I'm interested in Polish-{English,Irish,Russian} and Irish-English. I have a lot of spare time :) Yes, I would be very interested in help with SVN, thank you.

IM... hmm... I have Google Chat, Tlen, and ICQ (if it still works). Any preference?

I have a few Polish-English wordlists that I've built up over the past few years of learning the language; I just need to gather them, sort them, and add morphological information. I'll certainly have a look at it. Thanks again. -- Jimregan 21:16, 6 October 2007 (BST)

English

Hi Francis, it was I who made the edits to the Apertium HOWTO. One thing, I changed the spelling to "realized" because that is how it is spelled in international English (US, Canada, etc.). Your spelling is a British variation, a French derivative (not German like the "z" spelling). Same goes for many other words. For instance, internationalization, not internationisation (it is even indicated as a spelling error in this Wiki editor, with red text, and so is "realise"). --Laseray 14:33, 17 April 2008 (BST)

Breton

  • Hi, besides TermOfis, our terminological database, we are building a lexicographical one. I haven't checked it recently but it should be around 60 000 lemmatized forms in it by now. Do you think it could help you ? --Fulup 17:31, 9 November 2008 (UTC)

Yes definitely! Do you know if it includes part-of-speech also?

no, it does not (I'll show you in December), but Omegawiki have some.

We have been working on extracting information from Jan Deloof's dictionary of Breton--Dutch, which he let us use under the GPL. Do you know of it? - Francis Tyers 17:33, 9 November 2008 (UTC)

yes, I get it here at home, but I haven't used it (I don't speak dutch). I think it should be ok to use anyway. --Fulup 17:40, 9 November 2008 (UTC)

Icelandic

Hi Fran. The two "bread" sentences I removed from regression tests was because they have never worked for me. I was going to show how the tests worked so I wanted them to be on the "right" side of the pending/regression tests... which is why I moved them. I've always done svn up and make before testing (learned that the hard way), so I don't understand why they work for you but not for me. --Martha 16:41, 5 March 2009 (UTC)

Basque

Is there any difference between the main diagrams "How Apertium works" between Apertium and Matxin? If yes, where, if not: What is the difference between Matxin and Apertium (except of character coding, (Matxin only iso) and usage of FreeLing (Matxin))?

My primary aim is

  • 1. English-Hungarian and German-Hungarian, -- Apertium Matxin (es-eu, en-eu)
  • 2. English-German and German-English, -- Apertium or Matxin (es-en, en-es Apertium)
  • 3. Hungarian-English and Hungarian-German. -- Matxin Apertium (eu-en, eu-es)

For 1 is Apertium the right tool, for 2 Apertium, for 3 Matxin, right?

Muki987 12:37, 9 April 2009 (UTC)

I would reverse the order.
  • 1. Matxin -- We have 'deep' analysis for English, so we should use it .
  • 2. Matxin or Apertium -- Again, 'deep' analysis is available for English. But there are many other tools which do en-de so I would count this as reasonably low priority.
  • 3. Apertium -- We have POS tagging and morphological analysis for Hungarian, so we should take advantage of this. But there is no free parser available.
Matxin now supports Unicode, I have updated the page. The main difference between Apertium and Matxin is that the latter uses FreeLing to do chunking and dependency parsing and then does re-ordering based on that. Whereas Apertium is restricted to re-ordering fixed length patterns, Matxin has some degree of recursion. We are planning to extend Apertium this year to support recursive re-ordering, and any resources made now will be able to be re-used in the future. A brief breakdown about current resources would be good. e.g. English analysis (Apertium or Freeling), English generation (Apertium), English--Hungarian bilingual lexicon (?), Hungarian analysis (Hunmorph), Hungarian generation (?). - Francis Tyers 12:51, 9 April 2009 (UTC)
Yes, now I see, Spanish-Basque is in Matxin. I will start with Matxin.
I'll check if analysis of English in Maxtin is good enough for Hungarian.
I have a quite good English-Hungarian lexicon, I don't think, that causes any problem to transfer it into Apertium xml format, I also know hunspell and the tools behind it quite good. I think that will help at Hungarian generation, and I still hope, I get some support from Hunmorph group in that.
Do you know any usable tool de-en, en-de, you consider being at the quality of Apertium?
You forgot my first question about the main diagram How Apertium works for matix. Such a diagram is very helpful for a beginner. Muki987 13:21, 9 April 2009 (UTC)
If provide some example sentences in English I can send you back the results of FreeLing analysis. -- If you don't want to install FreeLing yourself.
Regarding the lexicon, if you send it to me I'd be happy to take a look at how difficult it would be to convert.
Yes, hunspell is good.
Free software tools for English--German, unfortunately not. There are many commercial tools though.
Regarding the diagram, it is difficult to express like that, as both Apertium and Matxin are typical "transfer" systems. The easiest way of expressing it is that Matxin works on trees, while Apertium works on chunks. To get an idea of the difference, take a look at these two diagrams: chunking and parsing. Apertium analysis is more similar to the first, while Matxin approaches the second. - Francis Tyers
  • You can find my word collections on http://tkltrans.sf.net
  • I will install everything on my pc, so I'll generate examples myself.
  • I checked prompt, which is at present the best according the test, it is in fact miserable (E-G, G-E).
  • I think, the selection of the right word is unsolved, and even more unsolved is the finding and using of expressions like "no space left on", and the like.

Muki987 13:21, 10 April 2009 (UTC)

Expressions

What about expressions? For example "look after one's fences" at present not handled at all:

   * Peter looked after Martha's fences
   * Peter miraba después de las vallas de Martha 

The expression will be not at all recognized (Peter handled in the interest of Martha).

Is there something planned for this? Are there working examples available? 20-30% of our speech are expressions!!!! Muki987 13:38, 10 April 2009 (UTC)

We have two methods of handling expressions. The first is with multiword units in our dictionaries. Please try "He took away the rubbish". The second way is with TMX files, probably you know about them, but they contain translation segments. The example you have given would be a multiword unit. Probably "look after" → "cuidar", but I'll ask the maintainer of es-en when she gets back from holiday. We tend to gear our development towards translating "news text", where these kinds of expressions tend to be less frequent. So you'll have to excuse if we don't have full coverage :) - Francis Tyers 20:20, 10 April 2009 (UTC)
  • He took away the rubbish -- is this at all an expression??
  • Sacó la basura -- word for word the same thing??
"Take away" is a phrasal verb which is best translated in Spanish by either "llevar" or "sacar". Of course, it has other meanings, for example "take away meal", "a take away", "two take away three". But the most frequent is probably the one we have. As I mentioned on your talk page, lexical selection is something we'd like to work on. - Francis Tyers 21:30, 10 April 2009 (UTC)
  • He took the minutes -- he wrote the protocol
  • Tomó los minutos -- no word of protocole, I think wrong again
It would be "apuntar las actas" or "tomar las actas". - Francis Tyers 21:30, 10 April 2009 (UTC)
  • He took air - he breathed
  • Tomó aire - took air-bad again, should be breathed
I'm not sure I would use this in English. I'd say "take a breath". It doesn't sound very natural. - Francis Tyers 21:30, 10 April 2009 (UTC)

I could not find any working examples yet. If you have one, please also explain the English one, my English is not so good. THnaks, Muki987 21:22, 10 April 2009 (UTC)

I've extracted a list of the phrasal verbs we have and you can find them here: http://www.nopaste.com/p/afrZuyaJ3 - Francis Tyers 21:30, 10 April 2009 (UTC)
  • They went Dutch that evening - They payed each his bill
  • Pagaron a escote que anochecer
Again, that expression isn't part of my lexicon.
  • They want to go Dutch that evening- They wanted to pay each his bill
  • Quieren pagar a escote aquel anochecer - Google says: Want to pay a neckline that evening
I hope that is correct, and only Google is too stupid for this. Muki987 21:45, 10 April 2009 (UTC)
And I'm presumably too stupid for not knowing an obscure expression too? ;) - Francis Tyers 21:49, 10 April 2009 (UTC)
  • He takes a backseat in this project - he played a subordinate role
  • Toma un backseat en este proyecto - no word about subordinate role- bad
This is quite colloquial. As I mention above, we target our development towards translating news text, so if you can't find it on a search of site:news.bbc.co.uk in Google, the chances are we don't have it. This is not to say the system doesn't support it, just for our purposes we don't yet find the reward sufficient for the effort. - Francis Tyers 21:33, 10 April 2009 (UTC)

Word with multiple meanings

  • He primed the car's petrol tank He filled the gasoline tank
  • Él primed la gasolina del coche tanque (probably does not understand the word prime)

What are you, writing novels? I'd never say this, and Zipf's law would probably agree. The problem with "car's petrol tank" → "la gasolina del coche tanque" is a serious one and we should fix that. In fact, it kind of works if you remove the preceding article.

'Prime' as a verb means to prepare a mechanism for work. In reference to a petrol tank, though I have never heard that usage before, I understand it to mean to fill it -- and fill it to the top; in terms of a weapon (a much more common use), it means to arm it. I think it more likely that the phrase was 'primed the engine', which (among other preparations) includes filling it with fuel. 'prime' also means to apply a coat of primer (paint); in either case, 'preparar' is the most acceptable general purpose Spanish translation. -- Jimregan 09:46, 11 April 2009 (UTC)


No, I should like just to be able to get rid of the translation's hard part. Muki987 21:51, 10 April 2009 (UTC)
The Spanish←→English system is not suitable for post-edition, and probably won't be in the near future... that is unless you are translating a lot of repetitive text and have a large translation memory. We are an open-source project, we have to focus our limited resources on the achievable. - Francis Tyers 21:59, 10 April 2009 (UTC)
$ echo "car's petrol tank"  | apertium -d . en-es
El tanque de gasolina de coche

I'll see if I can fix that now. - Francis Tyers 21:44, 10 April 2009 (UTC)

$ echo "the car's petrol tank"  | apertium -d . en-es
El tanque de gasolina de coche
Done. - Francis Tyers 21:50, 10 April 2009 (UTC)
  • He woke up at prime time . he woke up very early
  • Él woke arriba en tiempo primo - does not understand word woke (past tense of wake)

"prime time" does not equate to "early" in English.

$ echo "He woke up" | apertium -d . en-es
Despertó

Almost right, should be "se despertó". - Francis Tyers 21:43, 10 April 2009 (UTC)

So also not a single working example. Any idea?Muki987 21:37, 10 April 2009 (UTC)

Yes, as I mentioned above, you can give examples of phrases which don't work "until the cows come home", but that isn't what we focus our efforts on. If you want to focus your efforts on that fine... we're primarily interested in dealing with the most frequent structures first. - Francis Tyers 21:43, 10 April 2009 (UTC)
You see, I am happy, if one example works. The rest is diligence. I understand, that finding the right word and the right expression is by far the hardest part. Then comes word order change, which as far as I can see, also handled. If one example works, one day all will work. Muki987 21:51, 10 April 2009 (UTC)
If you're looking for "set phrases", then here is a list of some we have collected (personally I think this is a waste of time, but some people like it). If you are looking for phrasal verbs, please see the examples in this list. If you are looking for "this phrase I heard one time in a film" to work, then possibly you have the wrong project. - Francis Tyers 21:57, 10 April 2009 (UTC)
I do not search for anything extravagant or unusual. All I'd like to have is to let the machine make the dirty work at translation. Maybe English is not that pictoresque language as Hungarian and German are, but I can say from own experience, that we (Hungarians) use in more than 10% of our speech expressions, that have a different meaning as a group of words, than simply the words following each other. When I shall be testing, I can show you for sure lot of them. And this is also the case for German. Now I am going to understand, install and test. I hope, all that will make sense. Muki987 10:04, 11 April 2009 (UTC)
When we build translators, we pay a lot of attention to frequency. That is, instead of starting with the low frequency "jewels" of the language, we start with the high frequency "building blocks" (this terminology thanks to Mikel, and you might enjoy this paper). Probably more than 10% of spoken English is expressions, but we consider it less important to correctly translate these than to correctly translate (for example "article noun" "article adjective noun" type phrases). What you are referring to is non-compositionality (the meaning of two words is not the sum of the meanings of the constituent words — e.g. "compact disc"), and it is one of the main "open issues" in machine translation. The importance of frequency in building MT systems usually takes a while to sink in — as it is usually completely the opposite of what linguists and translators think of as important, but most people get it eventually (although if like me they'll waste a good deal of time in the process!). - Francis Tyers 10:19, 11 April 2009 (UTC)

Expressions

1. I found the study of Mikel interesting. He - however- does not handle at all the quality of commercial translation systems. At present the best one is promt, a Russian one, and it produces 60-80% accuracy, which is far away from being usable. Why that?

  • Even the words are not completely available for any language pair of the world. That, because one person, the redakteur - is not able to understand and handle all words af a great language. I can give you good examples for this.
I will always sacrifice completeness to frequency. Le mieux est l'ennemi du bien. (The better is the enemy of the good) See this paper as well. - Francis Tyers 11:21, 11 April 2009 (UTC)
  • The expression coverage is even worse. Hungarian is in my opinion a very coherent language, much less special words and special prononciation than English or German, and still, even for Hungarian I do not know any, even near to complete collection of expressions, even thought the language itself is rich in expressions. I assume, English and German look even worse.
  • Statistic approaches, like google's look very promising at the first glance. At the second one they show, that there is no room for improvements in them, and they will remain forever on their 60-90% level because the lack of internal understanding and intelligence.
Personally I consider the way forward in MT to be a combination of rule-based and statistical approaches. For doing lexical selection for example, statistical approaches have many benefits over rule-based ones. - Francis Tyers 11:21, 11 April 2009 (UTC)

I tried to interpret you sentence: The importance of frequency in building MT systems usually takes a while to sink in.

Sink in means decrease?

Sink in here means "to understand fully" ("to assimilate") - Francis Tyers 11:11, 11 April 2009 (UTC)

You mean here, first an MT system bust be built frequently, and later on less frequently, because quality gets better and better?

No, I mean that when building a machine translation system, it is of vital importance to plan the work according to the frequency in the language. For example, it is more important to be able to translate "the" and "a" well than "communitarianism". It is more important to be able to translate simple structures (e.g. basic noun phrases... article noun 'the book', article adjective noun 'the big book') than complex relative clauses. With 1,000 words you can cover around 50% and with 20,000 words you can cover 90% of any English text. For systems that I build from scratch typically I set the "gauge of quality" for a 0.1 release to be "translates reasonably well sentences of 5--7 words". Starting from scratch, this typically takes around 6 months. - Francis Tyers 11:11, 11 April 2009 (UTC)

The expression "take a while" is not in your collection, however, the translation seems to be ok.:

  • it takes a while to listen to you
  • Toma un rato para escuchar a ti
The 'a while' is well translated, 'toma' should probably be 'cuesta', 'escuchar a ti' might be better 'escucharte'. Although the sentence in English doesn't make much sense, do you mean "It takes a while to understand you" ? If so, in Spanish I'd probably say "Cuesta entenderte" (although I'm not a native speaker, so any of my translations into Spanish are suspect) - Francis Tyers 11:11, 11 April 2009 (UTC)

Muki987 10:54, 11 April 2009 (UTC)

Ispell-aspell

Matxin docs says: 5.8 Morphological dictionary (eu_morph_gen)

Basque morphology is complex and owing to its agglutinative character, much of the standard free software for dealing with morphology (such as ispell or aspell) is not well adapted for it.


The above is not quite correct. The complete situation is as follows:

  • Ispell was written by Geoff Kuennings, an Englishman, however, the checking algorithmus was developed by Dömölki Bálint, a Hungarian.

Ispell is, as is, quite well suited for agglutinative languages with its suffix/prefix concept, that is its central part since its beginning. The only disadvantages are:

    • a. the limited number of affixes and
    • b. the not existing 2 level suffixes (it has just one level)

However, with that concept as it is, it is possible to write a very well working Hungarian spell checker, and I doubt, that any other language is more sophisticated in agglutination, than Hungarian.

  • Myspell is the development of Kevin ..., Canada, started as Ispell in c++. Had all features of Ispell from the very beginning. Németh László from Hungary added to that
    • 2 level affixing/prefixing
    • speed up of dictionary read in
    • Morphological capabilities
    • handling of UTF-8 characterset

and the product was renamed to hunspell

  • Aspell is an own story. It was originally focused to word corrections, and was not sufficient at all for agglutinating languages, because it did not support the suffix/prefix concept at all. From version 0.60 however, Kevin Atkinson added Ispell's suffix/prefix concept, that made aspell being as good as ispell for agglutinating languages, and from version 0.60.6 (maybe even earlier, not sure) it also supports 2 level prefixing/affixing, exactly as hunspell does. Muki987 19:03, 12 April 2009 (UTC)
I'm just the translator here, please feel free to send this commentary to Aingeru for correction. - Francis Tyers 19:07, 12 April 2009 (UTC)

Hungarian generator works fine

The secret is: dictionaries have to be taken from http://magyarispell.sf.net. chmorph and analyze are part of hunspell 1.2.8 (current version). Example on my discussion page, at the end. Muki987 20:04, 15 April 2009 (UTC)

Phases

I'd like to see the phases of translation. Here is what I did:

/tmp/x: Martha's cat and Peter are sweet

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-anmor /tmp/x
^Martha/Martha<np><ant><f><sg>$^'s/'s<gen>$ ^cat/cat<n><sg>$ ^and/and<cnjcoo>$ ^Peter/Peter<np><ant><m><sg>$ ^are/be<vbser><pres>$ ^sweet/sweet<adj><sint>/sweet<n><sg>$^./.<sent>$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-tagger /tmp/x
^Martha<np><ant><f><sg>$^'s<gen>$ ^cat<n><sg>$ ^and<cnjcoo>$ ^Peter<np><ant><m><sg>$ ^be<vbser><pres>$ ^sweet<adj><sint>$^.<sent>$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-pretransfer /tmp/x
^Martha<np><ant><f><sg>$^'s<gen>$ ^cat<n><sg>$ ^and<cnjcoo>$ ^Peter<np><ant><m><sg>$ ^be<vbser><pres>$ ^sweet<adj><sint>$^.<sent>$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-chunker /tmp/x
^nom_genitiu_nom<SN><DET><GD><sg>{^el<det><def><3><4>$ ^gato<n><3><4>$ ^de<pr>$ ^Martha<np><ant><f><sg>$}$ ^cnj<cnjcoo>{^y<cnjcoo>$}$ ^nom<SN><UNDET><m><sg>{^Peter<np><ant><3><4>$}$ ^be<Vcop><vbser><pri><PD><ND>{^ser<vbser><3><4><5>$}$ ^adj<SA><mf><ND>{^dulce<adj><2><3>$}$^punt<sent>{^.<sent>$}$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-interchunk /tmp/x
^nom_genitiu_nom<SN><DET><m><sg>{^el<det><def><3><4>$ ^gato<n><3><4>$ ^de<pr>$ ^Martha<np><ant><f><sg>$}$ ^cnj<cnjcoo>{^y<cnjcoo>$}$ ^nom<SN><PDET><m><sg>{^Peter<np><ant><3><4>$}$ ^be<Vcop><vbser><pri><p3><pl>{^ser<vbser><3><4><5>$}$ ^adj<SA><mf><pl>{^dulce<adj><2><3>$}$^punt<sent>{^.<sent>$}$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-postchunk /tmp/x
^el<det><def><m><sg>$ ^gato<n><m><sg>$ ^de<pr>$ ^Martha<np><ant><f><sg>$ ^y<cnjcoo>$ ^Peter<np><ant><m><sg>$ ^ser<vbser><pri><p3><pl>$ ^dulce<adj><mf><pl>$^.<sent>$

en@anonymous:~/tmp/download/forditas/apertium-en-es-0.6$ apertium -d . en-es-generador /tmp/x
el gato ~de Martha ~y Peter son dulces

Is that the correct order?

Yes, exactly correct. One is missing though en-es-anmor, which is the first stage. - Francis Tyers 09:22, 16 April 2009 (UTC)
Thanks, added anmor. Muki987 09:28, 16 April 2009 (UTC)

Comments

Added some comments to hunmorph speed on my talk page. Muki987 20:25, 16 April 2009 (UTC) PLease see the end of my talk page, thanks Muki987 21:01, 22 April 2009 (UTC) Please check again. Thanks Muki987 22:44, 22 April 2009 (UTC) x Muki987 08:12, 23 April 2009 (UTC)x. Muki987 08:48, 23 April 2009 (UTC) x Muki987 10:42, 25 April 2009 (UTC)