User:Irene/proposal

From Apertium
< User:Irene
Revision as of 20:53, 3 April 2017 by Irene (talk | contribs) (work plan suggestions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Contact Info

Name: Irene Tang
E-mail: itang1@swarthmore.edu
IRC nick: irene_
Location: Pennsylvania, USA / California, USA
Time zone: UTC -05:00 / UTC-08:00

Why are you are interested in machine translation? / Why are you are interested in Apertium?

I became interested in machine translation earlier in this school year when I was introduced to an organisation that works to translate the Bible for people interested in reading it—in particular, people who speak minority languages in which the text is not currently available. The representative mentioned that the translation process would be made exponentially easier and faster if only they had a computer program that could do a first-pass translation for linguists to reference, rather than starting from scratch by hand. This is a particular cause that I care about; and I’m sure there are many other groups and individuals who would appreciate machine translation as a handy supplement to their endeavors. I figured I could use my background in computer science and linguistics to contribute towards building up machine translation tools for the public to use.

I apply to Apertium because I believe in its success. Apertium is currently one of more successful translation endeavors—and while it lacks the data and traffic that is available to Google Translate, it stands out from corporate undertakings by being free/open-source and by catering towards uncommon, lesser-resourced languages. From my interactions on the IRC I’ve also noticed an active community of dedicated linguists/programmers, and I’ve read about how much Apertium has accomplished since its birth in 2004. I’m excited for Apertium’s mission.

Which of the published tasks are you interested in? What do you plan to do?

Discontiguous Multiwords
For an overview of Apertium’s discontiguous multiwords problem, consider the following set of sentences:

  1. I take out the rubbish.
  2. I take the rubbish out.
  3. Saco la basura.
  4. *Tomo la basura fuera.

Discontiguous multiwords are multi-word expressions that are separated by something in the middle. In the set of sentences above, take out is a multiword verb. When it is separated by the noun phrase the rubbish, it becomes a discontinuous multiword.

Apertium currently doesn’t offer support for discontinuous multiwords, and this is a source of many unfortunate translation errors. Take out is a multiword in English, but its Spanish translation sacar is not. Apertium can seamlessly translate (1) into (3) from English to Spanish: in (1), the whole phrasal verb take out is together, so Apertium can easily recognise and translate it as one unit. Take out correctly becomes saco, its first-person conjugation in Spanish. However, Apertium imperfectly translates (2) into (4) from English to Spanish: in (2), the phrasal verb take out is separated by the NP the rubbish, so Apertium doesn’t recognise it as a unit and incorrectly translates it as two separate words. Take becomes tomo and out becomes fuera, independently, which is not what we want; tomar fuera andsacar cannot be used interchangeably. This demonstrates that discontiguous multiwords produce significant wrinkles in the translation process.

My plan is to eliminate such errors by improving the multiwords processor into being able to recognise when sentences contain discontiguous multiwords, and then reorder the sentence structure so that the whole verb phrase is placed together before bilingual dictionary lookup occurs. For the set of sentences above, the processor should be able to recognise the discontinuous take___out in (2) and rearrange the sentence to look like the take out___ in (1).

^I/prpers<prn><subj><p1><mf><sg>$ ^took/take<vblex><past>$ ^the/the<det><def><sp>$ ^rubbish/rubbish<n><sg>$ ^out/out<pr>$

needs to become

^I/prpers<prn><subj><p1><mf><sg>$ ^took out/take<vblex><sep><past># out$ ^the/the<det><def><sp>$ ^rubbish/rubbish<n><sg>$

As noted in the wiki page for this project, this involves (1) creating a typology of discontinuous multiword expressions in some Germanic, Celtic, Romance, Turkic, and Uralic languages; (2) creating a module for recognising and reordering discontiguous multiword expressions; and (3) providing support for discontiguous multiwords in some existing language pairs. See work plan for details.

How and who will it benefit in society?

Discontiguous multiwords are common in Germanic, Celtic, Romance, Turkic, and Uralic languages. These groups make up the majority of Apertium’s language database. Apertium users of these five language groups stand to benefit from this project; in particular, users of the nine languages listed below in Part I of the workplan.

Why should Google and Apertium sponsor it?

This issue is rather large, but the solution is within close reach and it provides generous rewards. Discontinuous multiwords are quite common in everyday speech for those languages that they appear in, so fixing the problem will generously improve translation quality across the board. The discontiguous multiwords problem should be addressed the sooner the better; this project has been sitting in the GSoC ideas tank on the wiki since 2010.

Work plan

Community bonding period

  • Understand dix formats
  • Understand the current multiwords processor module
  • Understand the features of lt-toolbox
  • Devise a typology format, create a typology for types of English phrasal verbs / discontiguous multiword expressions
  • Devise a method for coding typologies into Apertium files

Part I: preparing data
Create a typology of different types of discontinuous multiword expressions in some Germanic, Celtic, Romance, Turkic, and Uralic languages. This is necessary for getting an idea of how to build a module in part II. The typologies should include an analysis of ones that cannot be created by the multiwords processor. Each week's typologies should be coded into Apertium files by the end of the weekend. I estimate that it would take 2-4 days to complete full investigations of multiword expressions in each language, depending on how familiar I am with the language. I chose the following languages for their significance in Apertium’s database and for my accessibility to/familiarity with them.

  • Week 1 (5/22): Romance- Spanish, Portuguese
  • Week 2 (5/29): Romance- Italian, Romanian
  • Week 3 (6/5): Germanic- Swedish | Celtic- Welsh
  • Week 4 (6/12): Turkic- Kyrgyz | Uralic- Finnish

Deliverable #1: typologies for types of discontiguous multiword expressions for the 9 listed languages currently supported by Apertium, with at least one from each of the five language categories.

Part II: building the module The module should respect discontiguous multiwords that may remain discontiguous in both languages. If we are trying to translate a discontiguous multiword from xxx —> yyy, and it is well-formed in language yyy for the word to be discontiguous, then the output translation should allow it to remain discontiguous. Otherwise, the sentence should be reordered and the word should be translated as a single unit.

  • Week 5 (6/19): devise a module for recognising multiword expressions in each of the languages that I created typologies for, write unit tests to make sure it is functioning correctly
    • Devise a compatible format for integrating the expressions into monodix: figure out how to annotate the words with respect to the current multiwords processor
  • Week 6 (6/26): (cont.)
  • Week 7 (7/3): write a script to have the module reorder sentences to unify discontiguous multiwords, write unit tests to make sure it is functioning correctly
    • Devise a method for integrating the script with Apertium
  • Week 8 (7/10): (cont.)

Deliverable #2: functioning discontiguous multiword processor, not yet integrated into Apertium

Part III: integrating the module into Apertium

  • Week 9 (7/17): "insert the module between Apertium-pretransfer and lt-proc-b, testing", is what the wiki says
  • Week 10 (7/24): (cont.)
  • Week 11 (7/31): include support for discontiguous multiwords in specific pairs
  • Week 12 (8/7): (cont.)

Project completed: fully-integrated typologies and module for processing discontiguous multiwords

  • Week 13 (8/14): testing
  • Week 14 (8/21): pencils down

List your skills and give evidence of your qualifications.

I’m a second-year Computer Science major and Linguistics minor at Swarthmore College (United States). English is my native language and Spanish is a language that I studied for four years in high school.

  • Relevant coursework: Data Structures/Algorithms, Computer Systems, Algorithm Analysis, Artificial Intelligence/Machine Learning, Syntax
  • Technical skills: Python, C++, C, Java
  • Coding challenges: https://github.com/irene-tang/discontiguous-multiwords (information is in the README)

List any non-Summer of Code plans you have for the summer.

If my project is accepted, then my plan is to complete GSoC and take some light elective course somewhere, either online or at a community college.