Difference between revisions of "User:Bandrandr/proposal"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
 
bas_____ on irc<br />
 
bas_____ on irc<br />
 
Moscow (GMT+3)
 
Moscow (GMT+3)
  +
  +
==CCh==
  +
Link to github: [https://github.com/BasilisAndr/chkchn/blob/master/tables]
  +
   
 
=Synopsis=
 
=Synopsis=
 
Chukchi is a language with rich and complicated morphology and incorporation.<br />
 
Chukchi is a language with rich and complicated morphology and incorporation.<br />
By now morphological parsers using regular expressions were not able to handle it properly <br />
+
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly: there was no documentation. <br />
  +
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.<br />
HFST seems to be the solution
 
  +
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages, one of which is Chukchi.
   
 
==Deliverables==
 
==Deliverables==
Anticipated result: morphological analyser for Chukchi that handles
+
Anticipated result: <br />
  +
*well-documented,
  +
*easy to use
  +
morphological analyser for Chukchi that handles
 
* nouns
 
* nouns
 
* verbs
 
* verbs
 
* incorporation (probably)
 
* incorporation (probably)
  +
that occur in a collection of Chukchi texts.<br />
  +
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.
   
 
==Benefits==
 
==Benefits==
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a corpus of Chukchi.
+
The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi.<br />
  +
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?
  +
   
 
=Timeline=
 
=Timeline=
 
==Post-application period==
 
==Post-application period==
 
Investigation time:
*Getting to know HFST better
+
*get to know HFST better
 
* get a full picture on Chukchi morphology
 
*improve skills in building finite-state transducers
 
*improve skills in building finite-state transducers
  +
* make some test cases to aid further development
   
 
==Community bonding period==
 
==Community bonding period==
  +
* start working with nouns
Investigation time:
 
* getting the whole picture of Chukchi morphology
 
* planning the architecture
 
   
 
==Work period==
 
==Work period==
  +
The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this:
*'''Week 1''' nouns
 
*'''Week 2'''
+
*'''Week 1''' 40% coverage of the corpus forms
*'''Week 3'''
+
*'''Week 2''' 55%
*'''Week 4'''
+
*'''Week 3''' 65%
 
*'''Week 4''' 75%
'''Milestone #1''' HFST for nouns (and adjectives?)
+
'''Milestone #1''' 75% coverage of the corpus
*'''Week 5''' verbs
 
*'''Week 6'''
+
*'''Week 5''' 80%
*'''Week 7'''
+
*'''Week 6''' 83%
*'''Week 8'''
+
*'''Week 7''' 86%
'''Milestone #2''' HFST for verbs?
+
*'''Week 8''' 90%
  +
'''Milestone #2''' 90% coverage of the corpus
*'''Week 9'''
 
*'''Week 10'''
+
*'''Week 9''' 92%
*'''Week 11'''
+
*'''Week 10''' 94%
*'''Week 12''' final debugging, writing documentation
+
*'''Week 11''' 96%
 
*'''Week 12''' 98% coverage
  +
The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.
   
 
=Personal information=
 
=Personal information=
 
==Skills and Qualifications==
 
==Skills and Qualifications==
4 years of Fundamental and applied linguistics, (almost completed Bachelor degree in linguistics)<br />
+
4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.<br />
 
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br />
 
'''Languages:''' Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)<br />
 
'''Programming skills:''' Python, R, bash
 
'''Programming skills:''' Python, R, bash
   
 
==Non-GSoC summer plans==
 
==Non-GSoC summer plans==
I am going to write my bachelor thesis by middle June, so I will only be able to spend 10-15 hours per week.<br />
+
I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.<br />
 
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br />
 
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.<br />
 
Apart from that, I am going to work full-time up to 50 hours a week.
 
Apart from that, I am going to work full-time up to 50 hours a week.

Revision as of 02:15, 3 April 2017

Project title

Chukchi morphological analyser using HFST

Contacts

Vasilisa Andriyanets
blindedbysunshine@gmail.com
github.com/basilisandr
bas_____ on irc
Moscow (GMT+3)

CCh

Link to github: [1]


Synopsis

Chukchi is a language with rich and complicated morphology and incorporation.
By now morphological parsers using regular expressions were not able to handle it properly. The platforms themselves were not very user-friendly: there was no documentation.
HFST offers more possibilities than regular expressions for both analysing and constructing forms of Chukchi.
Apertium is, on the one hand, a platform that uses HFST, and on the other hand, a community that is interested in minor languages, one of which is Chukchi.

Deliverables

Anticipated result:

  • well-documented,
  • easy to use

morphological analyser for Chukchi that handles

  • nouns
  • verbs
  • incorporation (probably)

that occur in a collection of Chukchi texts.
From the other point of view it will be a simple tool for automated glossing of Chukchi texts in Russian as meta-language.

Benefits

The result of this work, if it succeeds, would be of great use for linguists investigating Chukchi and an important brick for building a morphologically annotated corpus of Chukchi.
It will also, of course, allow for future machine translation tools between Chukchi and Russian. ?


Timeline

Post-application period

Investigation time:

  • get to know HFST better
  • get a full picture on Chukchi morphology
  • improve skills in building finite-state transducers
  • make some test cases to aid further development

Community bonding period

  • start working with nouns

Work period

The most salient way to set weekly goals is to set the percent of the corpus (aka the collection of texts) forms coverage, so the timeline is roughly like this:

  • Week 1 40% coverage of the corpus forms
  • Week 2 55%
  • Week 3 65%
  • Week 4 75%

Milestone #1 75% coverage of the corpus

  • Week 5 80%
  • Week 6 83%
  • Week 7 86%
  • Week 8 90%

Milestone #2 90% coverage of the corpus

  • Week 9 92%
  • Week 10 94%
  • Week 11 96%
  • Week 12 98% coverage

The corpus is not very large, so hopefully I will be able to analyse all or almost all of the forms.

Personal information

Skills and Qualifications

4 years of Fundamental and applied linguistics, almost completed Bachelor degree in linguistics at NRU HSE, Moscow, Russia.
Languages: Russian (native), English (advanced), German (intermediate), Yiddish (intermediate), Norwegian (intermediate), French (elementary)
Programming skills: Python, R, bash

Non-GSoC summer plans

I am going to write my bachelor thesis by mid June, so I will only be able to spend 10-15 hours per week.
I am also going for a conference on 9-15 July, so I will be able to spend 15-20 hours for the project that week.
Apart from that, I am going to work full-time up to 50 hours a week.