User:Jalopeura/GSOC2010Application

From Apertium
Jump to navigation Jump to search

Name[edit]

Sean Healy

E-mail address[edit]

Gmail: sean.max

Hotmail: jalopeura

Other information that may be useful to contact you[edit]

IRC: SeanH

Why is it you are interested in machine translation?[edit]

I recently went back to school after eight years as a professional programmer. I decided to combine my professional abilities with the interest I've always had in languages and go into Natural Language Processing. I am currently in the first year of my Master's degree program.

Why is it that you are interested in the Apertium project?[edit]

I am interested in seeing how people outside of my particular academic program are doing Machine Translation. Apertium's open-source nature means I can work on it without being a student at a particular university or an employee of a particular company.

Which of the published tasks are you interested in? What do you plan to do?[edit]

Title[edit]

French-Portuguese language pair for Apertium

Why Google and Apertium should sponsor it[edit]

A new language pair is always good for Apertium's visibility; as one Apertium contributor put it, language pairs are Apertium's "bread and butter". French is third most widely spoken Romance language, after Spanish and Portuguese. As such, within the domain of Romance languages, pairings with French would seem to be the next logical target for Apertium.

How and who will it benefit in society[edit]

French is the third most widely used language in the European Union, after English and German (http://wapedia.mobi/en/Languages_of_the_European_Union?t=3.). Simply put, more information is available in French than in Portuguese in the EU. An open source machine translation system from French to Portuguese would be helpful for Portuguese speakers.

Work plan[edit]

Community bonding period (26.04-23.05): Familiarize self with Apertium dictionary and transfer rules formats

  • Week 1 (24.05-30.05): Generate dictionaries using crossdics
  • Week 2 (31.05-06.06): Verify 100% coverage for closed categories and inflection paradigms and 80% coverage otherwise in French monolingual dictionary
  • Week 3 (07.06-13.06): Verify 100% coverage for closed categories and inflection paradigms and 80% coverage otherwise in Portuguese monolingual dictionary
  • Week 4 (14.06-20.06): Verify all words from monolingual dictionaries are present in bilingual dictionary (using testvoc); copy transfer rules from Spanish-French as a starting point.

Deliverable #1: Dictionaries and first ("Spanishesque") version of translator

  • Week 5 (21.06-27.06): Transfer rules
  • Week 6 (28.06-04.07): Transfer rules
  • Week 7 (05.07-11.07): Transfer rules
  • Week 8 (12.07-18.07): Transfer rules

Deliverable #2: Second version of translator

  • Week 9 (19.07-25.07): Test on large blocks of text; debug rules and dictionaries, add entries as necessary
  • Week 10 (26.07-01.08): Continuation of testing
  • Week 11 (02.08-08.08): Generate statistics (correction rates); documentation
  • Week 12 (09.08-15.08): Final evaluation


List your skills and give evidence of your qualifications[edit]

I have the following other language skills appropriate to my project idea:

French: Minored in it, good explicit knowledge of grammar, but until recently not much practice in speaking it with native speakers. However, I have been studying in France for the last six months and steadily improving.

Portuguese (Brazilian): Lived for three years with a Brazilian roommate while taking Portuguese classes; we spoke mostly Portuguese in the apartment.

As far as programming, I know both Perl and PHP. I was a professional programmer for eight years before returning to school, and have experience with additional technologies, but these seem the most relevant to the project.

I have participated, both through mailing list discussions and code contributions, to multiple Perl modules. I have also been following the development of the Haiku operating system. I have not yet contributed any code to the project, but I have done programming in the OS.

List any non-Summer-of-Code plans you have for the Summer[edit]

I have a large due June 2, and I must present it at the end of June, so I will have other obligations during Weeks 1, 2 and 6. I foresee no difficulties in finding 30 hours during Weeks 2 and 6, but during Week 1 I may be unable to spend a full 30 hours on this project. I have no other outside constraints on my time during the 12 weeks of GSOC 2010.