Difference between revisions of "Kazakh and Sakha/GSoC2018 report"

From Apertium
Jump to navigation Jump to search
Line 11: Line 11:
30 April - 6 July
30 April - 6 July


Stems were added to Kazakh-Sakha bilingual dictionary.
Stems were added to the Kazakh-Sakha bilingual dictionary.


7 July - 14 August
7 July - 14 August

Revision as of 07:40, 14 August 2018

This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bidix and enriching the Sakha morphological analyzer.

Commits

My commits can be found here. You can also download my work as a zip file.

Corpora and Coverage

Our corpora were Kazakh Wikipedia and Sakha Wikipedia articles.

Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.

30 April - 6 July

Stems were added to the Kazakh-Sakha bilingual dictionary.

7 July - 14 August

We focused on the Sakha morphological analyser to achieve 90% coverage. Stems were added, grammatical reference[1] was reviewed and test files were created.

Translator coverage

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 10000 150 2870 29.36% 70.57%

Sakha morphological analyser coverage

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 95654 4070 9015 73.16% 88.75%

References

Future work

  • Add more stems to Sakha monolingual dictionary
  • Add more stems to Kazakh-Sakha bilingual dictionary
  • Add transfer rules, etc.