Tatar and Bashkir/GSOC 2018

From Apertium
< Tatar and Bashkir
Revision as of 01:15, 10 August 2018 by Zu-ann (talk | contribs) (Created page with "This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation. ==List of commits== * The list of all my commits can be found here: https://...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.

List of commits

What was done

  • Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
  • The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
  • Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
  • Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
  • Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.

Statistics

tat.lexc bak.lexc bak.twol bidix Bilingual Coverage
Before
After

Future work

  • Continue improving Bashkir monolingual transducer to make more possible word forms analyzed.
  • Continue improving the coverage.
  • Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
  • Drop duplicates and sort each part in alphabetical order.