Difference between revisions of "User:Mjaskowski"

From Apertium
Jump to navigation Jump to search
Line 7: Line 7:
   
 
== Why is it you are interested in machine translation? ==
 
== Why is it you are interested in machine translation? ==
  +
linguistics (course, 4 lang), data mining, machine learning
  +
google translate,
  +
combines my interests.
 
 
 
== Why is it that they are interested in the Apertium project? ==
 
== Why is it that they are interested in the Apertium project? ==
  +
how did I find Apertium on gsoc? (computational linguistics)
  +
combines my interests
  +
open source (cool!), joining community
  +
experience -> PhD
  +
   
 
== Which of the published tasks are you interested in? ==
 
== Which of the published tasks are you interested in? ==
 
"Accent and diacritic restoration"
 
"Accent and diacritic restoration"
   
 
== Reasons why Google and Apertium should sponsor it ==
   
 
 
== Reasons why Google and Apertium should sponsor it ==
 
   
 
== A description of how and who it will benefit in society ==
 
== A description of how and who it will benefit in society ==
Line 36: Line 42:
 
Thanks to the work of Kevin Scannell we have already a Perl script which does the work for us. A drawback of his script is that it's... a script. We can, however, modify it a bit in such a way, that it doesn't unicodifies but only measures the performance given the original (utf-8) file of an input (ascii) file and the output produced by our application.
 
Thanks to the work of Kevin Scannell we have already a Perl script which does the work for us. A drawback of his script is that it's... a script. We can, however, modify it a bit in such a way, that it doesn't unicodifies but only measures the performance given the original (utf-8) file of an input (ascii) file and the output produced by our application.
   
  +
'''MT evaluation'''
Finally, one should check if the app improves MT. To this end we will use apertium-eval-translator tool to measure Word Error Rate (WER).
 
  +
Finally, one should check if the app improves or deteriorates MT. It is very well probable that unicodifying a file where diacritics were about ok, may deteriorate the quality of translation but it is worth checking. See also "texts partially deprived of diacritics" idea.
 
To this end we will use apertium-eval-translator tool to measure Word Error Rate (WER).
   
 
== Some ideas and remarks ==
 
== Some ideas and remarks ==
Line 59: Line 67:
   
 
== A detailed work plan ==
 
== A detailed work plan ==
  +
  +
czy da się wczytać perlowe dane c++em?
  +
plugin do ooffice? --> ooocostam?
   
 
'''Time Line:'''<br />
 
'''Time Line:'''<br />
Line 100: Line 111:
 
'''Final Evaluation:''' August 9 - August 16 <br /><br /><br />
 
'''Final Evaluation:''' August 9 - August 16 <br /><br /><br />
   
  +
  +
== non-summer-of-Code plans ==
  +
until may 30th -> classes (10h/week, half-time job 20h/week)
  +
after may 30th -> free of commitments. I may need a 3-4 free days in june due to exams.
   
 
== List your skills and give evidence of your qualifications ==
 
== List your skills and give evidence of your qualifications ==
  +
MSc in math (5 - best note), Ecole Polytechnique
  +
pursuing MSc in computer science (grad date: june 2011)
  +
phd -> machine learning and/or computational linguistics
  +
ICM (C++), Paris V (research), GPA, IL -> hardest project done
  +
  +
No experience in Open-Source projects -- I am keen to join!

Revision as of 20:31, 2 April 2010

Name: Maciej Jaśkowski

E-mail address maciej.jaskowski on gmail account

I live in Poland => CE time


Why is it you are interested in machine translation?

linguistics (course, 4 lang), data mining, machine learning google translate, combines my interests.

Why is it that they are interested in the Apertium project?

how did I find Apertium on gsoc? (computational linguistics) combines my interests open source (cool!), joining community experience -> PhD


Which of the published tasks are you interested in?

"Accent and diacritic restoration"

Reasons why Google and Apertium should sponsor it

A description of how and who it will benefit in society

Understanding of the problem

We are to write an application (in C++) which takes as input text (a result of deformatter) and outputs a text in UTF-8 with diacritics restored, the superblanks leaving untouched.

pipeline As such the application can be introduced into the Apertium pipeline between deformatter and the morphological analyser. Changes to apertium-header.sh script are therefore necessary; we need to introduce a new switch including the application into pipeline.

details In fact we are to write 3 applications performing the same task [1] (LL, LL2, FS). The first two being dictionary and word based and the latter letter based. The dictionary based algorithms are generally better. The biggest disadvantage is that they can't provide an answer for a word never seen in the dictionary and they work only if the dictionary is big enough.

evaluation Once they are all implemented we should perform automatic performance tests in order to choose the best combination of the three and build on top of them a metapplication (CMB) combining them in the best possible way for a language given.

Thanks to the work of Kevin Scannell we have already a Perl script which does the work for us. A drawback of his script is that it's... a script. We can, however, modify it a bit in such a way, that it doesn't unicodifies but only measures the performance given the original (utf-8) file of an input (ascii) file and the output produced by our application.

MT evaluation Finally, one should check if the app improves or deteriorates MT. It is very well probable that unicodifying a file where diacritics were about ok, may deteriorate the quality of translation but it is worth checking. See also "texts partially deprived of diacritics" idea. To this end we will use apertium-eval-translator tool to measure Word Error Rate (WER).

Some ideas and remarks

texts partially deprived of diacritics
In real world applications it might very well be that the input file is only partially deprived of diacritics. We could ascify the file completely before processing but it seems to be important to take advantage of the diacritics given.

The assumption that the diacritics given are the right ones seem plausible; instead of ascifying the file, we can employ a lazy approach and (roughly speaking) ascify only if we can't find any other solution for a word (in a context) given.

How about text with the wrong diacritics? e.g. seeing ǎ where it should be ă ? - Francis Tyers 20:17, 2 April 2010 (UTC)

ideas to improve LL
Although Kevin Scannell is not sure if my proposition will give us any improvement, I am keen to check the impact of applying Word Sense Disambiguation methods to LL algorithm. Of course the algorithm might work only if we have a dictionary big enough (which is also the case for ordinary LL and LL2)

investigating occuring errors
It is tempting for me to look in detail on the output of each and every of the algorithms to figure out what kind of errors are made. E.g. for the LL and LL2 algorithms one can foresee such kind of errors: 0. a word is misspelled 1. the ascified word is spelled correctly but it has never occured in the dictionary 2. two or more unicodification of an ascified word occur in the dictionary in the same context

the last two propositions are rather "low priority". To be done if time

A detailed work plan

czy da się wczytać perlowe dane c++em? plugin do ooffice? --> ooocostam?

Time Line:
Community Bonding Period
Week 1: April 27 - May 2

Week 2: May 3 - May 9

Week 3: May 10 - May 16

Week 4: May 17 - May 23


Coding Period
Week 5: May 24 - May 30

Week 6: May 31 - June 6

Week 7: June 7 - June 13

Week 8: June 14 - June 20

Deliverable:

Week 9: June 21 - June 27

Week 10: June 28 - July 24

Week 11: July 5 - July 11

Week 12: July 12 - July 18 (Mid-term Evaluation)

Deliverable:

Week 13: July 19 - July 25

Week 14: July 26 - August 1

Week 15: August 2 - August 8

Week 16: August 9 - August 16

Final Evaluation: August 9 - August 16



non-summer-of-Code plans

until may 30th -> classes (10h/week, half-time job 20h/week) after may 30th -> free of commitments. I may need a 3-4 free days in june due to exams.

List your skills and give evidence of your qualifications

MSc in math (5 - best note), Ecole Polytechnique pursuing MSc in computer science (grad date: june 2011) phd -> machine learning and/or computational linguistics ICM (C++), Paris V (research), GPA, IL -> hardest project done

No experience in Open-Source projects -- I am keen to join!