Difference between revisions of "Tatar and Bashkir/GSOC 2018"

Latest revision as of 13:17, 14 August 2018

This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.

List of commits[edit]

The list of all the commits can be found here: https://apertium.projectjj.com/gsoc2018/zu-ann/zu-ann.html.

tar.gz with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.tar.gz.

zip with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.zip.

What was done[edit]

Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.

The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.

Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.

Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.

Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.

Statistics[edit]

	tat.lexc	bak.lexc	bidix	Bilingual Coverage
Before	~ 26 400 lemmas	~ 2 800 lemmas	~ 2 600 lemmas	tat-bak: 72.00%, bak-tat: 68.63%
After	~ 56 900 lemmas	~ 56 000 lemmas	~ 50 000 lemmas	tat-bak: 87.56%, bak-tat: 88.98%

Bilingual coverage was calculated using aq-covtest (http://wiki.apertium.org/wiki/Apertium-quality/Application_Documentation) and Wikipedia dumps (https://dumps.wikimedia.org/ttwiki/20180801/ and https://dumps.wikimedia.org/bawiki/20180801/).

Future work[edit]

Continue improving Bashkir monolingual transducer to make more word forms analyzed.
Continue improving the coverage.
Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
Revise the dictionaries.

@@ Line 1: / Line 1: @@
-This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.
+This is the report for [https://summerofcode.withgoogle.com/projects/#5878649350258688 Google Summer of Code 2018 project] — Tatar-Bashkir machine translation.
 ==List of commits==
-* The list of all my commits can be found here: https://apertium.projectjj.com/gsoc2018/zu-ann/zu-ann.html.
+The list of all the commits can be found here: https://apertium.projectjj.com/gsoc2018/zu-ann/zu-ann.html.
-* tar.gz with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.tar.gz.
-* zip with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.zip.
+tar.gz with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.tar.gz.
+zip with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.zip.
 ==What was done==
-* Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
+Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
-* The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
-* Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
+The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
-* Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
-* Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.
+Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
+Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
+Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.
 ==Statistics==
 {|class="wikitable"
-| ||'''tat.lexc''' || '''bak.lexc''' || '''bak.twol''' || '''bidix''' || '''Bilingual Coverage'''
+| ||'''tat.lexc''' || '''bak.lexc''' || '''bidix''' || '''Bilingual Coverage'''
 |-
+|Before|| ~ 26 400 lemmas || ~ 2 800 lemmas || ~ 2 600 lemmas || tat-bak: 72.00%, bak-tat: 68.63%
-|Before|| || || || ||
 |-
+|After|| ~ 56 900 lemmas || ~ 56 000 lemmas || ~ 50 000 lemmas ||  tat-bak: 87.56%, bak-tat: 88.98%
-|After|| || || || ||
 |}
+Bilingual coverage was calculated using aq-covtest (http://wiki.apertium.org/wiki/Apertium-quality/Application_Documentation) and Wikipedia dumps (https://dumps.wikimedia.org/ttwiki/20180801/ and https://dumps.wikimedia.org/bawiki/20180801/).
 ==Future work==
-* Continue improving Bashkir monolingual transducer to make more possible word forms analyzed.
+* Continue improving Bashkir monolingual transducer to make more word forms analyzed.
 * Continue improving the coverage.
 * Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
+* Revise the dictionaries.
-* Drop duplicates and sort each part in alphabetical order.

Difference between revisions of "Tatar and Bashkir/GSOC 2018"

Latest revision as of 13:17, 14 August 2018

Contents

List of commits[edit]

What was done[edit]

Statistics[edit]

Future work[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools