Kazakh and Sakha/GSoC2018 report
< Kazakh and Sakha
Jump to navigation
Jump to search
Revision as of 18:24, 13 August 2018 by Eirien (talk | contribs) (Created page with "This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bid...")
This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bidix and enriching the Sakha morphological analyzer.
Contents
Commits
My commits can be found here. You can also download my work as a zip file.
Corpora and Coverage
Our corpora were Kazakh Wikipedia and Sakha Wikipedia articles.
Translator coverage
Corpus | Words | Stems before | Stems after | Coverage before | Coverage after |
---|---|---|---|---|---|
Wikipedia | 10000 | 150 | 2870 | 29.36% | 70.57% |
Sakha morphological analyser coverage
Corpus | Words | Stems before | Stems after | Coverage before | Coverage after |
---|---|---|---|---|---|
Wikipedia | 95654 | 4070 | 9015 | 73.16% | 88.75% |
Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.
References
- Ubryatova, E.I. (Ed.) (1982), Grammatika sovremennogo yakutskogo literaturnogo yazika, Moscow.
- SakhaTyla.Ru - Sakha Dictionary
- Казахско-русский словарь
- Online dictionary
Future work
- Add more stems to Sakha monolingual dictionary
- Add more stems to Kazakh-Sakha bilingual dictionary
- Add transfer rules, etc.