User:Firespeaker/GSoC2014/Application draft

From Apertium
< User:Firespeaker‎ | GSoC2014
Revision as of 03:49, 20 March 2014 by Firespeaker (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
  • Name:
    Jonathan North Washington
  • E-mail address / gtalk:
    (fill in)
  • Other information that may be useful to contact you:
    cell phone: (fill in)
  • Why is it you are interested in machine translation? / Why is it that they are interested in the Apertium project?
    I got interested in MT between Turkic languages in 2011 when I mentored Mirlan's tur-kir project. I found that with my familiarity with linguistics, my knowledge of the languages, and my bravery with new formalisms, I was able to learn quickly and do useful work. I've been active in Apertium ever since. I'm a Turkic linguist speciali‌sing in phonology, phonetics, and socio-historical linguistics, but because of my work with Apertium, I have started to consider myself a computational linguist as well.
  • Which of the published tasks are you interested in? What do you plan to do?
    I plan to "Adopt an unreleased language pair", or in this case three: tur-kir, kaz-kir, tur-uzb. These three pairs were developed originally as GSoC projects, but none of them made it to release quality (they are all currently in the nursery). My goal is to bring tur-kir and kaz-kir to release quality (trunk), and bring tur-uzb to at least "working" quality (staging).
  • Include a proposal, including
    • a title,
      Bringing tur-kir, kaz-kir, and tur-uzb out of nursery
    • reasons why Google and Apertium should sponsor it,
      These are three pairs that could be brought to production quality or something approaching it without too much work. There are probably not many other people who know these languages all well enough and are familiar enough with the pairs to accomplish this in one summer. If successful, this project would add three more production pairs to Apertium's inventory, quadrupling the number of Turkic pairs in production.
    • a description of how and who it will benefit in society,
      Over 80 million people speak the four languages involved as a first language, and they all stand to benefit from this project. Apertium also stands to benefit in the future, as the existence of these pairs for the public to interact with may attract more contributors.
    • and a detailed work plan (including, if possible, a brief schedule with milestones and deliverables). Include time needed to think, to program, to document and to disseminate.
      My overall goals are the following: a production-ready release of kaz-kir (consistent <10% WER, trimmed coverage ≥90%), a production-ready release of tur-kir (consistent <20% WER, trimmed coverage ≥90%), a stable release of uzb-tur (consistent <25% WER, trimmed coverage ≥80%). Before coding begins, I plan to bring all the transducers involved to ≥90% coverage and get one text in each of the 6 directions to target WER. For the first six weeks of coding, I plan to get two texts each week to target WER levels mostly by adding CG, lrx, and transfer rules. I will also be expanding the lexicons quite a bit. Early on, I will start running testvoc for nouns for all directions, and I hope to have a clean set of noun testvocs by or shortly after midterm eval. By midterm eval I should have come close to hitting my goals for coverage and WER. I plan to spend most of my time after midterm cleaning testvocs for all the pairs. As time allows, I will continue to add to the dictionaries and rulesets for each pair. For a detailed work plan (expected to continue to change), see
  • List your skills and give evidence of your qualifications. Tell us what is your current field of study, major, etc. Convince us that you can do the work. In particular we would like to know whether you have programmed before in open-source projects.
    I have been involved with Apertium since 2011, and have extensive experience writing and improving morphological transducers. I also have a reasonable amount of experience with bidix, CG, and lrx, and can manage transfer. I will need to develop my skills in running testvoc, as that will be another large focus of the work. As far as the languages go, I'm proficient in Kyrgyz and Kazakh, and can get by in and read Uzbek and Turkish (often with the help of a dictionary). I also work on the linguistics of all of these languages (especially Kazakh and Kyrgyz), and have a "deep" understanding of the way all of these languages work. I have at my disposal dictionaries and grammars for all the languages. I also know potential consultants for most of them available as well, which will be important for getting post-edited texts to get WER numbers on.
  • List any non-Summer-of-Code plans you have for the Summer, especially employment, if you are applying for internships, and class-taking. Be specific about schedules and time commitments. we would like to be sure you have at least 30 free hours a week to develop for our project.
    I will essentially be free for the entire period of GCI. My semester (and current work) end on May 2, and I will resume such activities on August 25. I will probably be doing minimal hourly tutoring-type work for a few weeks in May, and may lose a day here and there in June and July for attending a conference, domestic road travel, etc. I also plan to be working on my dissertation project over the summer, but I do not expect it to interfere with GSoC as I will not have any "crunch times" related to it during that period. During previous breaks from study/work, even with other looming deadlines, I have easily spent 30 hours a week on Apertium-related work.