User:MaryX/

From Apertium
< User:MaryX(Redirected from User:MaryX)
Jump to navigation Jump to search


Why is it you are interested in machine translation?

I've been interested in languages and how they're structured and how they relate to one another for a long time, and machine translation strikes me as a wonderfully sensible way to approach those questions.

Why is it that you are interested in the Apertium project?

The interest in Apertium follows from the interest in machine translation - I came across the program via the Google Summer of Code list of organizations, and it looked like a way to do something interesting (machine translation) in support of a good cause (I figure having good-quality machine translation readily available for lots of different languages is pretty crucial for making the internet accessible to people from all over the world).

Which of the published tasks are you interested in? What do you plan to do?

I plan to build the Hebrew-Arabic language pair, which currently doesn't exist. Hebrew and Arabic have a lot of grammar in common and often have identical syntax, making it easier to create a coherent machine translation between the two languages. I will also be able to adapt some of the material from the Maltese-Arabic and Maltese-Hebrew pairs, both of which are currently in staging. A Hebrew-Arabic pair would add to the body of resources for Semitic languages which, despite the large number of speakers, is comparatively small. A language pair involving Arabic could also be expanded to include various dialects of colloquial Arabic, since existing language resources are almost entirely restricted to Modern Standard Arabic. In terms of benefiting society, improving translation resources between Hebrew and Arabic would help enable increased individual dialogue and cultural exchange between Israelis and Arabs, which in turn could have positive repercussions for the Middle East peace process.

Work Plan

  • Week 0 (June 17-23): Familiarize self with existing monolingual dictionaries (taken from mt-ar and mt-he) and make list of parts needing expansion/completion; plan tests; make sure <sent>, <cm>, and <ij> are testvoc-clean
  • Week 1 (June 24-30): Expand bidix; work on verb scripts; make sure <num> and <pr> are testvoc-clean
  • Week 2 (July 1-7): Expand bidix (goal: 50% coverage); work on verb scripts; make sure <cnjcoo>, <cnjadv>, and <cnjsub> are testvoc-clean
  • Week 3 (July 8-14): Expand bidix ; work on verb scripts; make sure <adv> is testvoc-clean
  • Week 4 (July 15-21): Expand bidix (goal: 60% coverage); make ture <prn> and <det> are testvoc-clean
  • Week 5 (July 22-28): Expand bidix; make sure <adj> is testvoc-clean
  • Week 6 (July 29 - August 4): Wildcard (use to work on whatever particular issues need extra attention); conduct tests and write midterm evaluation (due August 2)
  • Week 7 (August 5-11): Expand bidix; work on transfer rules; make sure <n> and <np> are testvoc-clean
  • Week 8 (August 12-18): Expand bidix (goal: 75% coverage); work on transfer rules
  • Week 9 (August 19-25): Work on transfer rules and disambiguation; make sure <v> is testvoc-clean
  • Week 10 (August 26 - September 1): Corpus testvoc
  • Week 11 (September 2-8): Wildcard (use to work on whatever particular issues need extra attention); all categories should be testvoc-clean
  • Week 12 (September 9-15): Run tests (goal: 80% bidix coverage); expand documentation; tie up loose ends
  • Week 13 (September 16-23): ["Pencils down" date is Sept. 23 at 19:00 UTC] Tidy up, clarify documentation, write final evaluation (due Sept. 27)

Notes: I start fall classes on September 3, so I have written this plan with the anticipation of working much more during Weeks 0-10 than in Weeks 11-13. Monodix and bidix will be added to as needed even when not specified in the schedule. Bidix coverage refers to stems that exist in the bidix as well as the Hebrew monodix and the Arabic monodix. Testvoc-clean means clean in both directions. This plan is based in a large part on the Maltese-Arabic work plan for GSoC 2012.

Skills and Qualifications

I'm majoring in religious studies and vacillating between a major and a minor in math as well - not the most obviously applicable disciplines, I know, but I've been supplementing the math with computer science classes and my religious studies major involves a lot of work with languages. I've been learning German since middle school and in college I've studied Hebrew (both biblical and modern) and Arabic. Last summer I started building a website (to be launched for beta testing this fall, hopefully) that serves as a platform for collaborative grammatical analysis of the Hebrew Bible - since the grammatical interpretation of the text is often ambiguous, the site allows users to hand-tag words based on their grammatical features and also compare interpretations with other users and published sources. However, I haven't programmed in an open-source project before.

Coding Challenge

I am currently in the process of doing the coding challenge described here. My work so far can be found at http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-ara-heb/. Unfortunately, I've been rather busy with classes these past few weeks and it doesn't look like I'll be able to translate the entire text, so I've been focusing on the first three paragraphs in the Hebrew-to-Arabic direction.

Summer Plans

I will be traveling June 1 - June 20, and I return to school for the fall at the start of September. The time in between is currently completely unscheduled, so I can pretty much devote as much of my time to the project as is needed. (I will probably wind up doing a few small other things as well, but these can be scheduled around the project.) Once I start classes I will have much less free time, so my general idea is to get almost all of the work done before then, and use those last few weeks for tying up loose ends and making sure everything is documented.

Contact Info

I'm reachable via this wiki, via the #apertium IRC (also as MaryX), and by email at miss [dot] mary [dot] x [at] gmail [dot] com.