User:Mjaskowski

From Apertium
Jump to navigation Jump to search

Name: Maciej Jaśkowski

E-mail address maciej.jaskowski on gmail account

IRC: mjaskowski (not registered!) or mjaskow

skype: maciej.jaskowski (usually I'm offline though)

I live in Poland => CE time

Abstract[edit]

Input of a MT system is often corrupted. This might be because of incorrect diacritics or other errors.

Since the system is to be reliable, it should cope with it any way, like translate.google.com does.

The aim of this project is to figure out a solution for the problem stated above and make it accesible to all Apertium users.

Why is it you are interested in machine translation?[edit]

I can speak 4 languages (Polish, French, English and German) and learning a new langauge is always exciting for me. I know as well how much time does one need to learn yet another language. I can only imagine problems arising if one wants to learn a language which is not as popular as languages I have mentioned above (for example, try learning Polish!)

But why do we learn languages in the first place? Everyone has his own reasons but arguably the most important one is simply that we want to understand what other people have to say, what they have said or written already. In other words, for numerous reasons, there is a lot of information which doesn't get translated into your mother tongue but still it might be relevant for you!

The knowledge of languages helps me quite a lot in everyday web crawling. In fact I can't imagine ignoring the english or french part of Web. And it is the information which is arguably one of the most important thing in the 21st century! MT may give us access to all the information the people all over the world have gathered. Translation is of specifically high importance for EU where each language is equally important according to the EU law.

That is why MT is sooo important these days and also why it is interesting. To mess with stuff which is needful -- I love it!

I like to think about MT as a way to reverse the curse of Babel[1] :-)

And MT is like a Holy Grail for Computational Linguistics in which I find a particular interest.

Apart from all that, MT combines to a large extent all my scientific interests: machine learning, data mining and linguistics.


Why is it that they are interested in the Apertium project?[edit]

To be honest, I knew nothing about Apertium before I started looking for a nice Open-Source community for GSoC. I believe that in such a community, through knowledge sharing, I can learn many things not necessarily directly connected with programming or scientific stuff but also how to maintain big projects and how to work in a sizable group.

From the beginning, however, I knew that I want to get involved into a project related to linguistics, machine learning or data mining. These are stuff I am interested in and I want to dive into it during my PhD studies. And that is how I found Apertium.

But the more I IRC/mail with you, the more I learn about Apertium, the more code I read, I get persuaded that Apertium is not only the best known (to Google[2]) open-source MT system but also a one with great community around which is I believe of crucial importance for huge projects like this.

I strongly believe, that during that project I will understand in detail how does Apertium work which might help me in my future commitments as a PhD student.

Which of the published tasks are you interested in?[edit]

"Accent and diacritic restoration"

Reasons why Google and Apertium should sponsor it[edit]

The main reason is: because it enhances the ways one can use Apertium (see: next paragraph).

On the other hand, if you ask why should Google and Apertium sponsor me doing this project, I would answer:

  • because I like Computational Linguistics
  • because I hope helping Apertium beyond GSOC
  • because I am a promising student and I can learn a lot from you while performing this project

A description of how and who it will benefit in society[edit]

Input of a MT system, as Apertium, is often corrupted. This might be because of incorrect diacritics, orthographical spelling mistakes, mistakes in the format (html...), etc. Since, however, the system is to be reliable, it should cope with it any way, like translate.google.com[3] does.

If we focus here on diacritic restoration, it is because it is a common problem and most of the languages in the world use diacritics beyond ascii standard. To write a text without diacritic, is easier, faster and still 100% understandable for a human being. Many WWW users, therefore, are tempted to forget about the diacritics totally or partially (an example from my backyard:[4]). Once the text is written, whatever the reason was to write it in pure ascii, no one ever wants to restore the diacritics manually.

Nevertheless, these informations might be relevant for an Apertium user!

The work will have an impact on the tools provided already by Apertium, among others, Geriaoueg[5], XChat plugin[6] and www.apertium.org[7]., oooapertium plugin[8]. Thanks to that, the guys learning a particular language or trying to IRC-chat using Apertium may benefit of the system.

If we stick to the original idea of writing a charlifter C++ port, we might reuse the code and write a firefox add-on (though, I admit, I don't have much experience in it). Whichever approach we take we may include the charlifter capabilities to Open Office and xerakko will help us (I hope! :-)

Understanding of the problem[edit]

We are to write an application (in C++) which takes as input text (a result of deformatter) and outputs a text in UTF-8 with diacritics restored, the superblanks leaving untouched.

Thanks to the work of Kevin Scannell we have already a Perl script which does the work for us. A drawback of his script is that it's... a script. It can serve us as a reference, though.

pipeline As such the application can be introduced into the Apertium pipeline between deformatter and the morphological analyser. To this end we will be modifying modes.xml's. The diacritic restoration would be the first program in the pipeline. (thanks unhammer!)

details In fact we are to write 3 applications performing the same task [9] (LLU, LL2, FS) but in different manner. The first two being word based and intensively using dictionaries. LLU is a unigram model and LL2 is a bigram model. The latter is letter/grapheme based. The dictionary based algorithms are generally better. Their disadvantage, however, is that they can't provide an answer for a word never seen in the dictionary and they work only if the dictionary is big enough.

Each app will share the same interface. Each one should allow at least for:

  • learning new language (i.e. takes as input a corpus in some language and transforms it into data necessary for diacritic restoration)
  • diacritic restoration of a text (a result of deformatter) given it's language

learning Each app will be able to 'learn new language'. I.e. an app will read a xml corpus (with correct diacritics) in a language given and produce some binary output file. The xml corpora might either be gathered from Wikipedia or we may reuse the data already gathered by Kevin (we have the binary data used by his charlifter).

jimregan suggested, and I like that idea, that the 'inner pipeline' should look like this: corpus -> xml frequencies -> fst data along with probabilities/frequencies. The fst data will say something like 'pol' -> 'pół' (0.99), 'pol' -> 'pol' (0.01) or (in LL2) 'pol litra' -> 'pół litra' (0.999), 'pol litra' -> 'pol litra' (0.001) (the details of FSTProcessor are to be investigated).

The diacritic restoration part then will simply read the fst data along with probabilities and it may focus directly on restoring diacritics.

evaluation Once they are all implemented we should perform automatic performance tests in order to choose the best combination of the three and build on top of them a metapplication (CMB) combining them in the best possible way for a language given.

We are to write a simple script counting number of words which diacritics where restored correctly and incorrectly (or to count precision and recall).

  • The tests should be composed from parts of corpus which was never seen by the app.
  • Each test set will be composed of the original file (with right diacritics) an input file (asciified or partially asciified).
  • In the case of languages with little training data we will perform 10-fold cross-validation

MT evaluation Finally, one should check if the app improves or deteriorates MT. It is very well probable that unicodifying a file where diacritics were about ok, may deteriorate the quality of translation but it is worth checking. See also "texts partially deprived of diacritics" idea. To this end we will use apertium-eval-translator tool to measure Word Error Rate (WER).

post PoS evaluation We can also consider measuring the robustness of the PoS tagger with or without diacritic restoration. If we decide to implement some kind of diacritic restoration built in one of the existing apps in the pipeline, this measure will help us compare different approaches.

Some ideas and remarks[edit]

Texts partially deprived of diacritics[edit]

In real world applications it might very well be that the input file is only partially deprived of diacritics. We could ascify the file completely before processing but it seems to be important to take advantage of the diacritics given.

The assumption that the diacritics given are the right ones seem plausible; instead of ascifying the file, we can employ a lazy approach and (roughly speaking) ascify only if we can't find any other solution for a word (in a context) given.

How about text with the wrong diacritics? e.g. seeing ǎ where it should be ă ? - Francis Tyers 20:17, 2 April 2010 (UTC)

For example, Francis,

  • if we see 'połka' we look for words which ascification yields 'polka' but the third letter is 'ł'.

And we get 'półka' or 'półką' (but not: 'polka' or 'polką')

  • if, however, we see 'pólska' (i.e. the diacritics are wrong) then we can't find any relevant word. We ascify it then to 'polska' and find two possible answers: 'polską' and 'polska'
  • finally, if we see a word which diacritics are improper but we do find a word as in the first sample above then... well... no one is perfect :-) it is just a heuristic which might help with some languages but deteriorate results in others.

If we were to consider such cases we would have to build a system correcting orthographical spelling errors.

Ideas to improve LL[edit]

WSD Although Kevin Scannell is not sure if my proposition will give us any improvement, I am keen to check the impact of applying Word Sense Disambiguation methods to LL algorithm. Of course the algorithm might work only if we have a dictionary big enough (which is also the case for ordinary LL and LL2)

dictionary asciification Our aim is to improve Apertium in such a way that it can translate a text even if it is partially deprived of diacritics. Therefore, it is not insane to check if simple asciification of training corpus and dictionary would not yield comparable or better results.

In some languages it might help a lot. Consider a noun phrase: without diacritics: 'z piekna polka'. It is obvious for anyone speaking polish that it should be converted to 'z piękną polką' because after 'z' we have instrumental. That information, however, knows the PoS tagger not our app.

Obviously using such asciified dictionaries might in turn deteriorate overall performance of Apertium. But that is also the case with apps in the pipeline. If we choose to try that approach we will need an evaluation method to compare it with the standard approach. To this end, we might use the post PoS evaluation proposed above.

Investigating occuring errors[edit]

It is tempting for me to look in detail on the output of each and every of the algorithms to figure out what kind of errors are made. E.g. for the LL and LL2 algorithms one can foresee such kind of errors: 0. a word is misspelled 1. the ascified word is spelled correctly but it has never occured in the dictionary 2. two or more unicodification of an ascified word occur in the dictionary in the same context

the last three propositions are rather "low priority". To be done if time allows.

A detailed work plan[edit]

Deliverables[edit]

  • charlifter in C++
    • LLU
    • LL2
    • FS

each of the above needs:

    • data sets for training (source: WikiPedia, for each language!)
      • write/get a WikiPedia crawler and/or reuse data gathered by Kevin Scannell
        • and create xml data store gathering all the information needed
        • and create fst data store from xmls
      • write/get app for (partial) ascification
      • pass them through deformatter
  • evaluation
    • a bunch of scripts for automatic evaluation
      • standard evaluation (as desribed above; generally, as in Kevin's charlifter now)
      • post MT evaluation (as described above)
      • post PoS tagging evaluation (as described above, if we choose to do it)
    • results of evaluation for each language => choice of the best combination of apps for the particular language


The plan is to work on a daily basis with 2 different languages (say french and polish) and only after having everything set up to start introducing other languages.

Thanks to that approach, we will have the staff up and running quickly.

The same applies to different diacritic restoration approaches: I wish to start with LLU (since, arguably, it is the simplest app to be coded) and only when all the other steps are done (i.e. we have evaluation scripts etc.) I will start coding other apps.

The first four weeks are planned in more detail. Every four weeks a plan for next four weeks will be released (that will take into consideration the lessons learned from previous weeks) (I don't have much experience in such planning either...)

Since there is still some stuff to be set up and to be discussed only the parts which are sure are included in the timeline below.

Time Line[edit]

Community Bonding Period: Weeks -3:1 April 27 - May 30 (I have some commitments there, so my involvement may not be very high)

  • gathering data sets for training (at least french and polish)
  • checking which existing code of Apertium can I reause (e.g. FSTprocessor) and how to do it properly. Playing with it a bit.
  • unit tests for LLU correctness
  • plan of how to implement LLU
  • write a script which converts current charlifter's binary data files into some more human readable and editable, possibly xml format.

At the end of the period:

  • plan for next 12 weeks revisited

Coding Period
Week 2: May 31 - June 6:

  • LLU is up and running
  • plan for writing the bunch of scripts

Week 3: June 7 - June 13

  • writing the bunch of scripts (deprived to standard evaluation, though)
  • correcting bugs in LLU

Deliverable: charlifter in c++ (with LLU implemented)

Week 4: June 14 - June 20

  • performing standard evaluation (at least for french and polish)
  • putting app in the pipeline
  • plan for next 8 weeks revisited


Week 5-8: June 21 - July 18

  • working on LL2 and FS
  • evaluation performed, the right parameters chosen
  • plan for next 4 weeks revisited

Deliverable:

  • charlifter in c++ in at least 2 languages
  • MT evaluation done

Week 9-12: July 19 - August 16:

  • working on other languages

non-summer-of-Code plans[edit]

Until may 30th -> classes (10h/week, half-time job 20h/week)
After may 30th -> free of commitments. I may need a 3-4 free days in june due to exams.

List your skills and give evidence of your qualifications[edit]

Education: 2004-march 2010: I have just graduated in mathematics on the University of Warsaw [10] (got 5 on MSc diploma in 2-5 scale; 5 is the best mark in Poland). I spend 10 months on Erasmus in Ecole Polytechnique (Palaiseau) and had some brilliant results

2005-now: pursuing MSc in computer science on the same faculty (expected graduation date: june 2011).
My GPA in both subjects exceeds 4.6 (5 being the best mark, 2 the worst one; top 10% of students every year).

2003: finalist of Polish Olympiad in Informatics

2003,2002: finalist of Polish Olympiad in Mathematics

Relevant Coursework:

  • Objective Oriented Programming (A)
  • Algorithmics and Data Structures (B+)
  • Computational Mathematics (A)
  • Approximate Reasoning (B+)
  • Probability Theory I (B+)
  • Probability Theory II (A!)


Right now I am looking for a good oportunity to start PhD studies on problems related to machine learning and/or computational linguistics in general.



Linguistics: Last semester I had the opportunity to take part in a course "Linguistic Engeeniring -- Words" conducted by Adam Przepiórkowski[nlp.ipipan.waw.pl/~adamp/]. Right now I take part in the second part of that course. Mr Przepiórkowski has a very good opinion of me, my skills and of the project I have done during the course[11]. See: [12]

C++: My biggest extra-academic experience with C++ I had last summer during internship at ICM [13]. We were to port an app from Fortran to C++ and then to CUDA. We managed to do the first part, the second appeared to be a bit more complicated -- we continue working on it again since last month.

Open-Source experience: I have, however, no experience in Open-Source projects. It might be seen as a shortcoming of my proposal. On the other hand, I am keen to join a fine community like Apertium seems to be and GSOC is a great opportunity to do so and to gain this experience :-)

Team work: A thing I am trully proud of, is my presidency of MIMUW[14] Students Government[15] between November 2009 and March 2010.
We managed to make a long jump from an almost non-existing SG with a single person actively acting, to a SG where 10 guys (or 1% of the students) is highly involved and other 10-15 is supporting it from time to time. To a SG which works like a small company :-)
Here[16] you can find some statistics.

References: Adam Przepiórkowski, PhD [17] Andrzej Skowron, prof. [18] Tadeusz Mostowski, prof. [19]