Tatar and Bashkir/GSOC 2018
Jump to navigation
Jump to search
This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.
List of commits
- The list of all my commits can be found here: https://apertium.projectjj.com/gsoc2018/zu-ann/zu-ann.html.
- tar.gz with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.tar.gz.
- zip with commits can be downloaded here: https://apertium.projectjj.com/gsoc2018/zu-ann.zip.
What was done
- Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
- The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
- Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
- Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
- Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.
Statistics
tat.lexc | bak.lexc | bak.twol | bidix | Bilingual Coverage | |
Before | |||||
After |
Future work
- Continue improving Bashkir monolingual transducer to make more possible word forms analyzed.
- Continue improving the coverage.
- Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
- Revise the dictionaries and drop duplicates.