Difference between revisions of "User:Bandrandr/proposal"

From Apertium
Jump to navigation Jump to search
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
==Project title==
=Chukchi morphological analyser using HFST=
+
Chukchi morphological analyser using HFST
   
 
==Contacts==
 
===Contacts===
 
 
Vasilisa Andriyanets<br />
 
Vasilisa Andriyanets<br />
 
blindedbysunshine@gmail.com<br />
 
blindedbysunshine@gmail.com<br />
Line 9: Line 9:
 
Moscow (GMT+3)
 
Moscow (GMT+3)
   
==Synopsis==
+
==CCh==
  +
Link to github: [https://github.com/BasilisAndr/chkchn/blob/master/tables]
  +
  +
  +
=Synopsis=
 
Chukchi is a language with rich and complicated morphology and incorporation.<br />
 
Chukchi is a language with rich and complicated morphology and incorporation.<br />
By now morphological parsers using regular expressions were not able to handle it properly <br />
+
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever). <br />
  +
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.<br />
HFST seems to be the solution
 
  +
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages. <br />
  +
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.
   
===Deliverables===
+
==Deliverables==
Anticipated result: morphological analyser for Chukchi that handles
+
Anticipated result: <br />
  +
*well-documented,
  +
*easy to use
  +
morphological analyser for Chukchi that handles
 
* nouns
 
* nouns
 
* verbs
 
* verbs
 
* incorporation (probably)
 
* incorporation (probably)
  +
that occur in a collection of Chukchi texts.<br />
  +
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.
   
===Benefits===
+
==Benefits==
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a corpus of Chukchi.
+
The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.<br />
  +
It will also, of course, allow for future machine translation between Chukchi and Russian.
   
==Timeline==
+
=Timeline=
===Post-application period===
+
==Post-application period==
Getting to know HFST better,<br />
 
improve skills in building finite-state transducers
 
 
===Community bonding period===
 
 
Investigation time:
 
Investigation time:
 
*get to know HFST better
* getting the whole picture of Chukchi morphology
+
* get a full picture on Chukchi morphology
* planning the architecture
 
 
*improve skills in building finite-state transducers
  +
* make some test cases to aid further development
   
===Work period===
+
==Community bonding period==
*'''Week 1''' nouns
+
* start working with nouns
*'''Week 2'''
 
*'''Week 3'''
 
*'''Week 4'''
 
'''Milestone #1''' HFST for nouns (and adjectives?)
 
*'''Week 5''' verbs
 
*'''Week 6'''
 
*'''Week 7'''
 
*'''Week 8'''
 
'''Milestone #2''' HFST for verbs?
 
*'''Week 9'''
 
*'''Week 10'''
 
*'''Week 11'''
 
*'''Week 12''' final debugging, writing documentation
 
   
==Personal information==
+
==Work period==
  +
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline goes roughly like this:
===Skills and Qualifications===
 
  +
*'''Week 1''' 40% coverage of the corpus forms
4 years of Fundamental and applied linguistics<br />
 
 
*'''Week 2''' 55%
'''Programming skills''': Python, R, bash
 
 
*'''Week 3''' 65%
 
*'''Week 4''' 75%
 
'''Milestone #1''' 75% coverage of the corpus
 
*'''Week 5''' 80%
 
*'''Week 6''' 83%
 
*'''Week 7''' 86%
 
*'''Week 8''' 90%
 
'''Milestone #2''' 90% coverage of the corpus
 
*'''Week 9''' 92%
 
*'''Week 10''' 94%
 
*'''Week 11''' 96%
  +
*'''Week 12''' 98% coverage
  +
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.
   
  +
=Personal information=
===Non-GSoC summer plans===
 
 
==Skills and Qualifications==
I am going to write my bachelor thesis by middle June, so I will only be able to spend 10-15 hours per week.<br />
 
 
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.<br />
  +
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br />
 
'''Programming skills:''' Python, R, bash
  +
 
==Non-GSoC summer plans==
 
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.<br />
 
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br />
 
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br />
 
Apart from that, I am going to work full-time up to 50 hours a week.
 
Apart from that, I am going to work full-time up to 50 hours a week.
  +
  +
  +
[[Category:GSoC_2017_Student_Proposals|Bandrandr]]

Latest revision as of 00:30, 15 April 2017

Project title[edit]

Chukchi morphological analyser using HFST

Contacts[edit]

Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)

CCh[edit]

Link to github: [1]


Synopsis[edit]

Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly (no documentation whatsoever).
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages.
Chukchi is a minority language in Russia that needs a transducer-based morphological parser -- seems like a perfect match.

Deliverables[edit]

Anticipated result:

  • well-documented,
  • easy to use

morphological analyser for Chukchi that handles

  • nouns
  • verbs
  • incorporation (probably)

that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.

Benefits[edit]

The result of this work, if I get selected, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi that could be easily updated with automated glosses.
It will also, of course, allow for future machine translation between Chukchi and Russian.

Timeline[edit]

Post-application period[edit]

Investigation time:

  • get to know HFST better
  • get a full picture on Chukchi morphology
  • improve skills in building finite-state transducers
  • make some test cases to aid further development

Community bonding period[edit]

  • start working with nouns

Work period[edit]

The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline goes roughly like this:

  • Week 1 40% coverage of the corpus forms
  • Week 2 55%
  • Week 3 65%
  • Week 4 75%

Milestone #1 75% coverage of the corpus

  • Week 5 80%
  • Week 6 83%
  • Week 7 86%
  • Week 8 90%

Milestone #2 90% coverage of the corpus

  • Week 9 92%
  • Week 10 94%
  • Week 11 96%
  • Week 12 98% coverage

The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.

Personal information[edit]

Skills and Qualifications[edit]

4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash

Non-GSoC summer plans[edit]

I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.