Difference between revisions of "User:Bandrandr/proposal"
(4 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
bas_____ on irc<br /> |
bas_____ on irc<br /> |
||
Moscow (GMT+3) |
Moscow (GMT+3) |
||
==CCh== |
|||
Link to github: [https://github.com/BasilisAndr/chkchn/blob/master/tables] |
|||
=Synopsis= |
=Synopsis= |
||
Chukchi is a language with rich and complicated morphology and incorporation.<br /> |
Chukchi is a language with rich and complicated morphology and incorporation.<br /> |
||
By now morphological parsers using regular expressions were not able to handle it properly <br /> |
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever). <br /> |
||
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.<br /> |
|||
HFST seems to be the solution |
|||
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages. <br /> |
|||
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match. |
|||
==Deliverables== |
==Deliverables== |
||
Anticipated result: |
Anticipated result: <br /> |
||
*well-documented, |
|||
*easy to use |
|||
morphological analyser for Chukchi that handles |
|||
* nouns |
* nouns |
||
* verbs |
* verbs |
||
* incorporation (probably) |
* incorporation (probably) |
||
that occur in a collection of Chukchi texts.<br /> |
|||
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language. |
|||
==Benefits== |
==Benefits== |
||
The result of this work, if |
The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.<br /> |
||
It will also, of course, allow for future machine translation between Chukchi and Russian. |
|||
=Timeline= |
=Timeline= |
||
==Post-application period== |
==Post-application period== |
||
⚫ | |||
* |
*get to know HFST better |
||
⚫ | |||
*improve skills in building finite-state transducers |
*improve skills in building finite-state transducers |
||
* make some test cases to aid further development |
|||
==Community bonding period== |
==Community bonding period== |
||
* start working with nouns |
|||
⚫ | |||
⚫ | |||
* planning the architecture |
|||
==Work period== |
==Work period== |
||
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline goes roughly like this: |
|||
⚫ | |||
*'''Week |
*'''Week 1''' 40% coverage of the corpus forms |
||
*'''Week |
*'''Week 2''' 55% |
||
*'''Week |
*'''Week 3''' 65% |
||
⚫ | |||
'''Milestone #1''' |
'''Milestone #1''' 75% coverage of the corpus |
||
⚫ | |||
*'''Week |
*'''Week 5''' 80% |
||
*'''Week |
*'''Week 6''' 83% |
||
*'''Week |
*'''Week 7''' 86% |
||
''' |
*'''Week 8''' 90% |
||
'''Milestone #2''' 90% coverage of the corpus |
|||
*'''Week 9''' |
|||
*'''Week |
*'''Week 9''' 92% |
||
*'''Week |
*'''Week 10''' 94% |
||
*'''Week |
*'''Week 11''' 96% |
||
⚫ | |||
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms. |
|||
=Personal information= |
=Personal information= |
||
==Skills and Qualifications== |
==Skills and Qualifications== |
||
4 years of Fundamental and applied linguistics, |
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.<br /> |
||
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br /> |
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br /> |
||
'''Programming skills:''' Python, R, bash |
'''Programming skills:''' Python, R, bash |
||
==Non-GSoC summer plans== |
==Non-GSoC summer plans== |
||
I am going to write my bachelor thesis by |
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.<br /> |
||
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br /> |
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br /> |
||
Apart from that, I am going to work full-time up to 50 hours a week. |
Apart from that, I am going to work full-time up to 50 hours a week. |
Latest revision as of 00:30, 15 April 2017
Contents
Project title[edit]
Chukchi morphological analyser using HFST
Contacts[edit]
Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)
CCh[edit]
Link to github: [1]
Synopsis[edit]
Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever).
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages.
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.
Deliverables[edit]
Anticipated result:
- well-documented,
- easy to use
morphological analyser for Chukchi that handles
- nouns
- verbs
- incorporation (probably)
that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.
Benefits[edit]
The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.
It will also, of course, allow for future machine translation between Chukchi and Russian.
Timeline[edit]
Post-application period[edit]
Investigation time:
- get to know HFST better
- get a full picture on Chukchi morphology
- improve skills in building finite-state transducers
- make some test cases to aid further development
Community bonding period[edit]
- start working with nouns
Work period[edit]
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline goes roughly like this:
- Week 1 40% coverage of the corpus forms
- Week 2 55%
- Week 3 65%
- Week 4 75%
Milestone #1 75% coverage of the corpus
- Week 5 80%
- Week 6 83%
- Week 7 86%
- Week 8 90%
Milestone #2 90% coverage of the corpus
- Week 9 92%
- Week 10 94%
- Week 11 96%
- Week 12 98% coverage
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.
Personal information[edit]
Skills and Qualifications[edit]
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash
Non-GSoC summer plans[edit]
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.