Tatar and Bashkir/GSOC 2018

From Apertium
Jump to navigation Jump to search

This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.

List of commits

What was done

  • Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
  • The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
  • Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
  • Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
  • Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.

Statistics

tat.lexc bak.lexc bak.twol bidix Bilingual Coverage
Before
After

Future work

  • Continue improving Bashkir monolingual transducer to make more possible word forms analyzed.
  • Continue improving the coverage.
  • Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
  • Revise the dictionaries and drop duplicates.