User:Bandrandr/proposal
Contents
Project title
Chukchi morphological analyser using HFST
Contacts
Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)
CCh
Link to github: [1]
Synopsis
Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever).
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages.
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.
Deliverables
Anticipated result:
- well-documented,
- easy to use
morphological analyser for Chukchi that handles
- nouns
- verbs
- incorporation (probably)
that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.
Benefits
The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?
Timeline
Post-application period
Investigation time:
- get to know HFST better
- get a full picture on Chukchi morphology
- improve skills in building finite-state transducers
- make some test cases to aid further development
Community bonding period
- start working with nouns
Work period
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline goes roughly like this:
- Week 1 40% coverage of the corpus forms
- Week 2 55%
- Week 3 65%
- Week 4 75%
Milestone #1 75% coverage of the corpus
- Week 5 80%
- Week 6 83%
- Week 7 86%
- Week 8 90%
Milestone #2 90% coverage of the corpus
- Week 9 92%
- Week 10 94%
- Week 11 96%
- Week 12 98% coverage
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.
Personal information
Skills and Qualifications
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash
Non-GSoC summer plans
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.