Difference between revisions of "User:MaryX/"
(4 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
[[Category: GSoC 2013 Student proposals]] |
|||
==Why is it you are interested in machine translation? == |
==Why is it you are interested in machine translation? == |
||
I've been interested in languages and how they're structured and how they relate to one another for a long time, and machine translation strikes me as a wonderfully sensible way to approach those questions. |
I've been interested in languages and how they're structured and how they relate to one another for a long time, and machine translation strikes me as a wonderfully sensible way to approach those questions. |
||
Line 8: | Line 10: | ||
I plan to build the Hebrew-Arabic language pair, which currently doesn't exist. Hebrew and Arabic have a lot of grammar in common and often have identical syntax, making it easier to create a coherent machine translation between the two languages. I will also be able to adapt some of the material from the Maltese-Arabic and Maltese-Hebrew pairs, both of which are currently in staging. A Hebrew-Arabic pair would add to the body of resources for Semitic languages which, despite the large number of speakers, is comparatively small. A language pair involving Arabic could also be expanded to include various dialects of colloquial Arabic, since existing language resources are almost entirely restricted to Modern Standard Arabic. In terms of benefiting society, improving translation resources between Hebrew and Arabic would help enable increased individual dialogue and cultural exchange between Israelis and Arabs, which in turn could have positive repercussions for the Middle East peace process. |
I plan to build the Hebrew-Arabic language pair, which currently doesn't exist. Hebrew and Arabic have a lot of grammar in common and often have identical syntax, making it easier to create a coherent machine translation between the two languages. I will also be able to adapt some of the material from the Maltese-Arabic and Maltese-Hebrew pairs, both of which are currently in staging. A Hebrew-Arabic pair would add to the body of resources for Semitic languages which, despite the large number of speakers, is comparatively small. A language pair involving Arabic could also be expanded to include various dialects of colloquial Arabic, since existing language resources are almost entirely restricted to Modern Standard Arabic. In terms of benefiting society, improving translation resources between Hebrew and Arabic would help enable increased individual dialogue and cultural exchange between Israelis and Arabs, which in turn could have positive repercussions for the Middle East peace process. |
||
==Work Plan== |
|||
For creating a specific work plan, I was wondering whether there were work plans for similar tasks from previous years that I could look at for ideas and to see how long different aspects of building a language pair took. In the absence of that, my (very sketchy) plan is as follows: |
|||
- End of June: Planning period, including finding resources for building dictionaries and finding/translating texts for testing |
|||
- First half of July: Input a basic framework that covers the most common grammatical structures |
|||
- Second half of July and first half of August: Test the basic framework with a series of texts, adding and making changes as needed |
|||
- Second half of August: Expand the dictionaries |
|||
- September: Tie up loose ends, write/expand documentation |
|||
The test texts can be made to serve as milestones (i.e., I can aim to be able to translate such-and-such text by such-and-such date). |
|||
* '''Week 0''' (June 17-23): Familiarize self with existing monolingual dictionaries (taken from mt-ar and mt-he) and make list of parts needing expansion/completion; plan tests; make sure <sent>, <cm>, and <ij> are testvoc-clean |
|||
''Some of the other projects listed on the "Ideas" page looked interesting/within my abilities as well, such as creating an interface for hand-tagging corpora, or improving bilingual dictionary induction - if working on one of those (or working on developing a different language pair) would be more useful, I'd be open to doing that instead.'' |
|||
* '''Week 1''' (June 24-30): Expand bidix; work on verb scripts; make sure <num> and <pr> are testvoc-clean |
|||
* '''Week 2''' (July 1-7): Expand bidix (goal: 50% coverage); work on verb scripts; make sure <cnjcoo>, <cnjadv>, and <cnjsub> are testvoc-clean |
|||
* '''Week 3''' (July 8-14): Expand bidix ; work on verb scripts; make sure <adv> is testvoc-clean |
|||
* '''Week 4''' (July 15-21): Expand bidix (goal: 60% coverage); make ture <prn> and <det> are testvoc-clean |
|||
* '''Week 5''' (July 22-28): Expand bidix; make sure <adj> is testvoc-clean |
|||
* '''Week 6''' (July 29 - August 4): Wildcard (use to work on whatever particular issues need extra attention); conduct tests and write midterm evaluation (due August 2) |
|||
* '''Week 7''' (August 5-11): Expand bidix; work on transfer rules; make sure <n> and <np> are testvoc-clean |
|||
* '''Week 8''' (August 12-18): Expand bidix (goal: 75% coverage); work on transfer rules |
|||
* '''Week 9''' (August 19-25): Work on transfer rules and disambiguation; make sure <v> is testvoc-clean |
|||
* '''Week 10''' (August 26 - September 1): Corpus testvoc |
|||
* '''Week 11''' (September 2-8): Wildcard (use to work on whatever particular issues need extra attention); all categories should be testvoc-clean |
|||
* '''Week 12''' (September 9-15): Run tests (goal: 80% bidix coverage); expand documentation; tie up loose ends |
|||
* '''Week 13''' (September 16-23): ["Pencils down" date is Sept. 23 at 19:00 UTC] Tidy up, clarify documentation, write final evaluation (due Sept. 27) |
|||
''Notes:'' I start fall classes on September 3, so I have written this plan with the anticipation of working much more during Weeks 0-10 than in Weeks 11-13. Monodix and bidix will be added to as needed even when not specified in the schedule. Bidix coverage refers to stems that exist in the bidix as well as the Hebrew monodix and the Arabic monodix. Testvoc-clean means clean in both directions. This plan is based in a large part on the [[Maltese_and_Arabic/Work_plan|Maltese-Arabic work plan for GSoC 2012]]. |
|||
==Skills and Qualifications== |
==Skills and Qualifications== |
||
I'm majoring in religious studies and vacillating between a major and a minor in math as well - not the most obviously applicable disciplines, I know, but I've been supplementing the math with computer science classes and my religious studies major involves a lot of work with languages. I've been learning German since middle school and in college I've studied Hebrew (both biblical and modern) and Arabic. Last summer I started building a website (to be launched for beta testing this fall, hopefully) that serves as a platform for collaborative grammatical analysis of the Hebrew Bible - since the grammatical interpretation of the text is often ambiguous, the site allows users to hand-tag words based on their grammatical features and also compare interpretations with other users and published sources. However, I haven't programmed in an open-source project before. |
I'm majoring in religious studies and vacillating between a major and a minor in math as well - not the most obviously applicable disciplines, I know, but I've been supplementing the math with computer science classes and my religious studies major involves a lot of work with languages. I've been learning German since middle school and in college I've studied Hebrew (both biblical and modern) and Arabic. Last summer I started building a website (to be launched for beta testing this fall, hopefully) that serves as a platform for collaborative grammatical analysis of the Hebrew Bible - since the grammatical interpretation of the text is often ambiguous, the site allows users to hand-tag words based on their grammatical features and also compare interpretations with other users and published sources. However, I haven't programmed in an open-source project before. |
||
==Coding Challenge== |
|||
I am currently in the process of doing the coding challenge described [[Ideas_for_Google_Summer_of_Code/Adopt_a_language_pair|here]]. My work so far can be found at http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-ara-heb/. Unfortunately, I've been rather busy with classes these past few weeks and it doesn't look like I'll be able to translate the entire text, so I've been focusing on the first three paragraphs in the Hebrew-to-Arabic direction. |
|||
==Summer Plans== |
==Summer Plans== |
||
Line 25: | Line 39: | ||
==Contact Info== |
==Contact Info== |
||
I'm reachable via this wiki, via the #apertium IRC (also as MaryX), and by email at miss [dot] mary [dot] x [at] gmail [dot] com. |
I'm reachable via this wiki, via the #apertium IRC (also as MaryX), and by email at ''miss [dot] mary [dot] x [at] gmail [dot] com''. |
Latest revision as of 13:28, 14 May 2013
Contents
Why is it you are interested in machine translation?[edit]
I've been interested in languages and how they're structured and how they relate to one another for a long time, and machine translation strikes me as a wonderfully sensible way to approach those questions.
Why is it that you are interested in the Apertium project?[edit]
The interest in Apertium follows from the interest in machine translation - I came across the program via the Google Summer of Code list of organizations, and it looked like a way to do something interesting (machine translation) in support of a good cause (I figure having good-quality machine translation readily available for lots of different languages is pretty crucial for making the internet accessible to people from all over the world).
Which of the published tasks are you interested in? What do you plan to do?[edit]
I plan to build the Hebrew-Arabic language pair, which currently doesn't exist. Hebrew and Arabic have a lot of grammar in common and often have identical syntax, making it easier to create a coherent machine translation between the two languages. I will also be able to adapt some of the material from the Maltese-Arabic and Maltese-Hebrew pairs, both of which are currently in staging. A Hebrew-Arabic pair would add to the body of resources for Semitic languages which, despite the large number of speakers, is comparatively small. A language pair involving Arabic could also be expanded to include various dialects of colloquial Arabic, since existing language resources are almost entirely restricted to Modern Standard Arabic. In terms of benefiting society, improving translation resources between Hebrew and Arabic would help enable increased individual dialogue and cultural exchange between Israelis and Arabs, which in turn could have positive repercussions for the Middle East peace process.
Work Plan[edit]
- Week 0 (June 17-23): Familiarize self with existing monolingual dictionaries (taken from mt-ar and mt-he) and make list of parts needing expansion/completion; plan tests; make sure <sent>, <cm>, and <ij> are testvoc-clean
- Week 1 (June 24-30): Expand bidix; work on verb scripts; make sure <num> and <pr> are testvoc-clean
- Week 2 (July 1-7): Expand bidix (goal: 50% coverage); work on verb scripts; make sure <cnjcoo>, <cnjadv>, and <cnjsub> are testvoc-clean
- Week 3 (July 8-14): Expand bidix ; work on verb scripts; make sure <adv> is testvoc-clean
- Week 4 (July 15-21): Expand bidix (goal: 60% coverage); make ture <prn> and <det> are testvoc-clean
- Week 5 (July 22-28): Expand bidix; make sure <adj> is testvoc-clean
- Week 6 (July 29 - August 4): Wildcard (use to work on whatever particular issues need extra attention); conduct tests and write midterm evaluation (due August 2)
- Week 7 (August 5-11): Expand bidix; work on transfer rules; make sure <n> and <np> are testvoc-clean
- Week 8 (August 12-18): Expand bidix (goal: 75% coverage); work on transfer rules
- Week 9 (August 19-25): Work on transfer rules and disambiguation; make sure <v> is testvoc-clean
- Week 10 (August 26 - September 1): Corpus testvoc
- Week 11 (September 2-8): Wildcard (use to work on whatever particular issues need extra attention); all categories should be testvoc-clean
- Week 12 (September 9-15): Run tests (goal: 80% bidix coverage); expand documentation; tie up loose ends
- Week 13 (September 16-23): ["Pencils down" date is Sept. 23 at 19:00 UTC] Tidy up, clarify documentation, write final evaluation (due Sept. 27)
Notes: I start fall classes on September 3, so I have written this plan with the anticipation of working much more during Weeks 0-10 than in Weeks 11-13. Monodix and bidix will be added to as needed even when not specified in the schedule. Bidix coverage refers to stems that exist in the bidix as well as the Hebrew monodix and the Arabic monodix. Testvoc-clean means clean in both directions. This plan is based in a large part on the Maltese-Arabic work plan for GSoC 2012.
Skills and Qualifications[edit]
I'm majoring in religious studies and vacillating between a major and a minor in math as well - not the most obviously applicable disciplines, I know, but I've been supplementing the math with computer science classes and my religious studies major involves a lot of work with languages. I've been learning German since middle school and in college I've studied Hebrew (both biblical and modern) and Arabic. Last summer I started building a website (to be launched for beta testing this fall, hopefully) that serves as a platform for collaborative grammatical analysis of the Hebrew Bible - since the grammatical interpretation of the text is often ambiguous, the site allows users to hand-tag words based on their grammatical features and also compare interpretations with other users and published sources. However, I haven't programmed in an open-source project before.
Coding Challenge[edit]
I am currently in the process of doing the coding challenge described here. My work so far can be found at http://apertium.svn.sourceforge.net/viewvc/apertium/incubator/apertium-ara-heb/. Unfortunately, I've been rather busy with classes these past few weeks and it doesn't look like I'll be able to translate the entire text, so I've been focusing on the first three paragraphs in the Hebrew-to-Arabic direction.
Summer Plans[edit]
I will be traveling June 1 - June 20, and I return to school for the fall at the start of September. The time in between is currently completely unscheduled, so I can pretty much devote as much of my time to the project as is needed. (I will probably wind up doing a few small other things as well, but these can be scheduled around the project.) Once I start classes I will have much less free time, so my general idea is to get almost all of the work done before then, and use those last few weeks for tying up loose ends and making sure everything is documented.
Contact Info[edit]
I'm reachable via this wiki, via the #apertium IRC (also as MaryX), and by email at miss [dot] mary [dot] x [at] gmail [dot] com.