User:Irene/proposal
Contents
- 1 Contact Info
- 2 Why are you are interested in machine translation? / Why are you are interested in Apertium?
- 3 Which of the published tasks are you interested in? What do you plan to do?
- 4 How and who will it benefit in society?
- 5 Why should Google and Apertium sponsor it?
- 6 Work Plan
- 7 List your skills and give evidence of your qualifications.
- 8 List any non-Summer of Code plans you have for the summer.
Contact Info
Name: Irene Tang
E-mail: itang1@swarthmore.edu
IRC nick: irene_
Location: Pennsylvania, USA / California, USA
Time zone: UTC -05:00 / UTC-08:00
Why are you are interested in machine translation? / Why are you are interested in Apertium?
I became interested in machine translation earlier in this school year when I was introduced to an organisation that works to translate the Bible for people interested in reading it—in particular, people who speak minority languages in which the text is not currently available. The representative mentioned that the translation process would be made exponentially easier and faster if only they had a computer program that could do a first-pass translation for linguists to reference, rather than starting from scratch by hand. This is a particular cause that I care about; and I’m sure there are many other groups and individuals who would appreciate machine translation as a handy supplement to their endeavors. I figured I could use my background in computer science and linguistics to contribute towards building up machine translation tools for the public to use.
I apply to Apertium because I believe in its success. Apertium is currently one of more successful translation endeavors—and while it lacks the data and traffic that is available to Google Translate, it stands out from corporate undertakings by being open-source and by catering towards uncommon, lesser-resourced languages. From my interactions on the IRC I’ve also noticed an active community of dedicated linguists/programmers, and I’ve read about how much Apertium has accomplished since its birth in 2004. I’m excited for Apertium’s mission.
Which of the published tasks are you interested in? What do you plan to do?
Discontiguous Multiwords
For an overview of Apertium’s discontiguous multiwords problem, consider the following set of sentences:
- I take out the rubbish.
- I take the rubbish out.
- Saco la basura.
- *Tomo la basura fuera.
Discontiguous multiwords are multi-word expressions that are separated by something in the middle. In the set of sentences above, take out is a multiword verb. When it is separated by the noun phrase the rubbish, it becomes a discontinuous multiword.
Apertium currently doesn’t offer support for discontinuous multiwords, and this is a source of many unfortunate translation errors. Apertium can seamlessly translate (1) into (3) from English to Spanish: in (1), the whole phrasal verb take out is together, so Apertium can easily recognize and translate it as one unit. Take out correctly becomes saco, its first-person conjugation in Spanish. However, Apertium imperfectly translates (2) into (4) from English to Spanish: in (2), the phrasal verb take out is separated by the NP the rubbish, so Apertium doesn’t recognize it as a unit and incorrectly translates it as two separate words. Take becomes tomo and out becomes fuera, independently, which is not what we want. This demonstrates that discontiguous multiwords produce significant wrinkles in the translation process.
My plan is to improve the multiwords processor into being able to recognize when sentences contain discontiguous multiwords, and then reorder the sentence structure so that the whole verb phrase is placed together before bilingual dictionary lookup occurs. As noted in the wiki page for this project, this involves (1) creating a typology of discontinuous multiword expressions in some Germanic, Celtic, Romance, Turkic, and Uralic languages; (2) creating a module for recognising and reordering discontiguous multiword expressions; and (3) supporting discontiguous multiwords for specifically the English-Spanish pair.
How and who will it benefit in society?
Discontiguous multiwords are common in Germanic, Celtic, Romance, Turkic, and Uralic languages. These groups make up the majority of Apertium’s language database. All Apertium users of these five language groups stand to benefit from this project.
Why should Google and Apertium sponsor it?
This issue is rather large, but the solution is within close reach and it provides generous rewards. Discontinuous multiwords are quite common in everyday speech (for those languages that they appear in), so fixing the problem will generously improve translation quality across the board. The discontiguous multiwords problem should be addressed the sooner the better; but this project has been sitting in the GSoC ideas tank on the wiki since 2010.
Work Plan
Community bonding period – (begin typology)
Part I: preparing data – create a typology of different types of discontinuous multiword expressions in Germanic, Celtic, Romance, Turkic, and Uralic languages. This helps with getting an idea of how to build a module in part II. I estimate that it would take 2-5 days to investigate multiword expressions in each language, depending on how familiar I am with the language. I chose the following languages for their significance in Apertium’s database and for my accessibility to them. I’m more familiar with the Romance languages than the others. Creating typologies for roughly 10-12 languages would take up at least a hefty month’s worth of time; I plan to start on Part I during the community-bonding period.
- Week 1 (5/22): Germanic- English, Swedish | Celtic- Welsh
- Week 2 (5/29): Romance- Portuguese, Spanish, French
- Week 3 (6/5): Romance- Italian, Romanian
- Week 4 (6/12): Turkic- | Uralic- Finnish
Deliverable #1: typologies of discontiguous multiword expressions for 10-12 languages currently supported by Apertium, with at least one from each of the five language categories.
Part II: building the module – create a module/script for recognising and reordering discontiguous multiword expressions
- Week 5 (6/19):
- Week 6 (6/26):
- Week 7 (7/3):
- Week 8 (7/10):
Deliverable #2: functioning discontiguous multiword processor, not yet integrated into Apertium
Part III: - integrating the module into Apertium (insert between Apertium-pretransfer and lt-proc-b)
- Week 9 (7/17):
- Week 10 (7/24):
- Week 11 (7/31): include support for discontiguous multiwords in specific pairs
- Week 12 (8/7): include support for discontiguous multiwords in specific pairs
Project completed: typologies and fully-integrated module for processing discontiguous multiwords
- Week 13 (8/14): testing
- Week 14 (8/21): pencils down
List your skills and give evidence of your qualifications.
I’m a second-year Computer Science major and Linguistics minor at Swarthmore College (United States). English is my native language and Spanish is a language that I studied for four years in high school.
- Relevant coursework: Data Structures/Algorithms, Computer Systems, Algorithm Analysis, Artificial Intelligence/Machine Learning, Syntax
- Technical skills: Python, C++, C, Java
- Coding challenges: https://github.com/irene-tang/discontiguous-multiwords (information is in the README)
List any non-Summer of Code plans you have for the summer.
If my project is accepted, then my plan is to complete GSoC and take some light elective course somewhere, either online or at a community college.