Difference between revisions of "Kazakh and Sakha/GSoC2018 report"

From Apertium
Jump to navigation Jump to search
(Created page with "This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bid...")
 
 
(3 intermediate revisions by the same user not shown)
Line 6: Line 6:
== Corpora and Coverage ==
== Corpora and Coverage ==
Our corpora were [https://dumps.wikimedia.org/kkwiki/20180501/kkwiki-20180501-pages-articles.xml.bz2 Kazakh Wikipedia] and [https://dumps.wikimedia.org/sahwiki/20180501/sahwiki-20180501-pages-articles.xml.bz2 Sakha Wikipedia] articles.
Our corpora were [https://dumps.wikimedia.org/kkwiki/20180501/kkwiki-20180501-pages-articles.xml.bz2 Kazakh Wikipedia] and [https://dumps.wikimedia.org/sahwiki/20180501/sahwiki-20180501-pages-articles.xml.bz2 Sakha Wikipedia] articles.

Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.

30 April - 6 July

Stems were added to the Kazakh-Sakha bilingual dictionary.

7 July - 14 August

We focused on the Sakha morphological analyser to achieve 90% coverage. Stems were added, grammatical reference[1] was reviewed and test files were created.


===Translator coverage===
===Translator coverage===
Line 44: Line 54:
| 88.75%
| 88.75%
|}
|}

Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.


==References==
==References==
* Ubryatova, E.I. (Ed.) (1982), Grammatika sovremennogo yakutskogo literaturnogo yazika, Moscow.
# Ubryatova, E.I. (Ed.) (1982), Grammatika sovremennogo yakutskogo literaturnogo yazika, Moscow.
* [https://sakhatyla.ru/ SakhaTyla.Ru - Sakha Dictionary]
# [https://sakhatyla.ru/ SakhaTyla.Ru - Sakha Dictionary]
* [https://sozdik.kz/ Казахско-русский словарь]
# [https://sozdik.kz/ Казахско-русский словарь]
* [https://glosbe.com/ Online dictionary]
# [https://glosbe.com/ Online dictionary]


==Future work==
==Future work==

Latest revision as of 07:41, 14 August 2018

This page serves as a summary of all the work done in the Kazakh and Sakha pair during Google Summer of Code 2018. The project consisted mainly of building a bilingual bidix and enriching the Sakha morphological analyzer.

Commits[edit]

My commits can be found here. You can also download my work as a zip file.

Corpora and Coverage[edit]

Our corpora were Kazakh Wikipedia and Sakha Wikipedia articles.

Mostly work consisted of adding stems to dictionaries. Stems were added from frequency lists.

30 April - 6 July

Stems were added to the Kazakh-Sakha bilingual dictionary.

7 July - 14 August

We focused on the Sakha morphological analyser to achieve 90% coverage. Stems were added, grammatical reference[1] was reviewed and test files were created.

Translator coverage[edit]

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 10000 150 2870 29.36% 70.57%

Sakha morphological analyser coverage[edit]

Corpus Words Stems before Stems after Coverage before Coverage after
Wikipedia 95654 4070 9015 73.16% 88.75%

References[edit]

  1. Ubryatova, E.I. (Ed.) (1982), Grammatika sovremennogo yakutskogo literaturnogo yazika, Moscow.
  2. SakhaTyla.Ru - Sakha Dictionary
  3. Казахско-русский словарь
  4. Online dictionary

Future work[edit]

  • Add more stems to Sakha monolingual dictionary
  • Add more stems to Kazakh-Sakha bilingual dictionary
  • Add transfer rules, etc.