Difference between revisions of "User:Bandrandr/proposal"
Line 8: | Line 8: | ||
bas_____ on irc<br /> |
bas_____ on irc<br /> |
||
Moscow (GMT+3) |
Moscow (GMT+3) |
||
==CCh== |
|||
Link to github: [https://github.com/BasilisAndr/chkchn/blob/master/tables] |
|||
=Synopsis= |
=Synopsis= |
||
Chukchi is a language with rich and complicated morphology and incorporation.<br /> |
Chukchi is a language with rich and complicated morphology and incorporation.<br /> |
||
By now morphological parsers using regular expressions were not able to handle it properly <br /> |
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly: there was no documentation. <br /> |
||
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.<br /> |
|||
HFST seems to be the solution |
|||
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages, one of which is Chukchi. |
|||
==Deliverables== |
==Deliverables== |
||
Anticipated result: |
Anticipated result: <br /> |
||
*well-documented, |
|||
*easy to use |
|||
morphological analyser for Chukchi that handles |
|||
* nouns |
* nouns |
||
* verbs |
* verbs |
||
* incorporation (probably) |
* incorporation (probably) |
||
that occur in a collection of Chukchi texts.<br /> |
|||
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language. |
|||
==Benefits== |
==Benefits== |
||
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a corpus of Chukchi. |
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi.<br /> |
||
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ? |
|||
=Timeline= |
=Timeline= |
||
==Post-application period== |
==Post-application period== |
||
⚫ | |||
* |
*get to know HFST better |
||
⚫ | |||
*improve skills in building finite-state transducers |
*improve skills in building finite-state transducers |
||
* make some test cases to aid further development |
|||
==Community bonding period== |
==Community bonding period== |
||
* start working with nouns |
|||
⚫ | |||
⚫ | |||
* planning the architecture |
|||
==Work period== |
==Work period== |
||
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this: |
|||
⚫ | |||
*'''Week |
*'''Week 1''' 40% coverage of the corpus forms |
||
*'''Week |
*'''Week 2''' 55% |
||
*'''Week |
*'''Week 3''' 65% |
||
⚫ | |||
'''Milestone #1''' |
'''Milestone #1''' 75% coverage of the corpus |
||
⚫ | |||
*'''Week |
*'''Week 5''' 80% |
||
*'''Week |
*'''Week 6''' 83% |
||
*'''Week |
*'''Week 7''' 86% |
||
''' |
*'''Week 8''' 90% |
||
'''Milestone #2''' 90% coverage of the corpus |
|||
*'''Week 9''' |
|||
*'''Week |
*'''Week 9''' 92% |
||
*'''Week |
*'''Week 10''' 94% |
||
*'''Week |
*'''Week 11''' 96% |
||
⚫ | |||
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms. |
|||
=Personal information= |
=Personal information= |
||
==Skills and Qualifications== |
==Skills and Qualifications== |
||
4 years of Fundamental and applied linguistics, |
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.<br /> |
||
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br /> |
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br /> |
||
'''Programming skills:''' Python, R, bash |
'''Programming skills:''' Python, R, bash |
||
==Non-GSoC summer plans== |
==Non-GSoC summer plans== |
||
I am going to write my bachelor thesis by |
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.<br /> |
||
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br /> |
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br /> |
||
Apart from that, I am going to work full-time up to 50 hours a week. |
Apart from that, I am going to work full-time up to 50 hours a week. |
Revision as of 02:15, 3 April 2017
Contents
Project title
Chukchi morphological analyser using HFST
Contacts
Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)
CCh
Link to github: [1]
Synopsis
Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly: there was no documentation.
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages, one of which is Chukchi.
Deliverables
Anticipated result:
- well-documented,
- easy to use
morphological analyser for Chukchi that handles
- nouns
- verbs
- incorporation (probably)
that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.
Benefits
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi.
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?
Timeline
Post-application period
Investigation time:
- get to know HFST better
- get a full picture on Chukchi morphology
- improve skills in building finite-state transducers
- make some test cases to aid further development
Community bonding period
- start working with nouns
Work period
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this:
- Week 1 40% coverage of the corpus forms
- Week 2 55%
- Week 3 65%
- Week 4 75%
Milestone #1 75% coverage of the corpus
- Week 5 80%
- Week 6 83%
- Week 7 86%
- Week 8 90%
Milestone #2 90% coverage of the corpus
- Week 9 92%
- Week 10 94%
- Week 11 96%
- Week 12 98% coverage
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.
Personal information
Skills and Qualifications
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash
Non-GSoC summer plans
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.