Kazakh and Sakha/GSoC2018 report

From Apertium
< Kazakh and Sakha
Revision as of 18:24, 13 August 2018 by Eirien (talk | contribs) (Created page with "This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bid...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bidix and enriching the Sakha morphological analyzer.

Commits

My commits can be found here. You can also download my work as a zip file.

Corpora and Coverage

Our corpora were Kazakh Wikipedia and Sakha Wikipedia articles.

Translator coverage

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 10000 150 2870 29.36% 70.57%

Sakha morphological analyser coverage

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 95654 4070 9015 73.16% 88.75%

Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.

References

Future work

  • Add more stems to Sakha monolingual dictionary
  • Add more stems to Kazakh-Sakha bilingual dictionary
  • Add transfer rules, etc.