User:Bandrandr/proposal

From Apertium
Jump to navigation Jump to search

Project title

Chukchi morphological analyser using HFST

Contacts

Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)

CCh

Link to github: [1]


Synopsis

Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever).
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages.
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.

Deliverables

Anticipated result:

  • well-documented,
  • easy to use

morphological analyser for Chukchi that handles

  • nouns
  • verbs
  • incorporation (probably)

that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.

Benefits

The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?


Timeline

Post-application period

Investigation time:

  • get to know HFST better
  • get a full picture on Chukchi morphology
  • improve skills in building finite-state transducers
  • make some test cases to aid further development

Community bonding period

  • start working with nouns

Work period

The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this:

  • Week 1 40% coverage of the corpus forms
  • Week 2 55%
  • Week 3 65%
  • Week 4 75%

Milestone #1 75% coverage of the corpus

  • Week 5 80%
  • Week 6 83%
  • Week 7 86%
  • Week 8 90%

Milestone #2 90% coverage of the corpus

  • Week 9 92%
  • Week 10 94%
  • Week 11 96%
  • Week 12 98% coverage

The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.

Personal information

Skills and Qualifications

4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash

Non-GSoC summer plans

I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.