Project title

Chukchi morphological analyser using HFST

Contacts

Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)

CCh

Link to github: [1]

Synopsis

Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever).
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages.
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.

Deliverables

Anticipated result:

well-documented,
easy to use

morphological analyser for Chukchi that handles

nouns
verbs
incorporation (probably)

that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.

Benefits

The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?

Timeline

Post-application period

Investigation time:

get to know HFST better
get a full picture on Chukchi morphology
improve skills in building finite-state transducers
make some test cases to aid further development

Community bonding period

start working with nouns

Work period

The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this:

Week 1 40% coverage of the corpus forms
Week 2 55%
Week 3 65%
Week 4 75%

Milestone #1 75% coverage of the corpus

Week 5 80%
Week 6 83%
Week 7 86%
Week 8 90%

Milestone #2 90% coverage of the corpus

Week 9 92%
Week 10 94%
Week 11 96%
Week 12 98% coverage

The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.

Personal information

Skills and Qualifications

4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash

Non-GSoC summer plans

I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.

User:Bandrandr/proposal

Contents

Project title

Contacts

CCh

Synopsis

Deliverables

Benefits

Timeline

Post-application period

Community bonding period

Work period

Personal information

Skills and Qualifications

Non-GSoC summer plans

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools