Difference between revisions of "Tatar and Bashkir/GSOC 2018"

From Apertium
Jump to navigation Jump to search
(Created page with "This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation. ==List of commits== * The list of all my commits can be found here: https://...")
 
Line 30: Line 30:
 
* Continue improving the coverage.
 
* Continue improving the coverage.
 
* Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
 
* Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
  +
* Revise the dictionaries and drop duplicates.
* Drop duplicates and sort each part in alphabetical order.
 

Revision as of 01:16, 10 August 2018

This is the report for Google Summer of Code 2018 project — Tatar-Bashkir machine translation.

List of commits

What was done

  • Lexicons in bak.lexc were changed to correspond to the ones in tat.lexc, missing lexicons and tags were added to bak.lexc and new rules were added to bak.twol.
  • The stems from tat.lexc were translated into Bashkir and added to bak.lexc and bidix.
  • Words from the Bashkir frequency list http://lcph.bashedu.ru/index.php?go=wikilist_lemmas were translated into Tatar and added to tat.lexc and bidix.
  • Using Russian-Tatar and Russian-Bashkir dictionaries new stems were added to tat.lexc, bak.lexc and bidix.
  • Using Wikidata new toponyms were added to tat.lexc, bak.lexc and bidix.

Statistics

tat.lexc bak.lexc bak.twol bidix Bilingual Coverage
Before
After

Future work

  • Continue improving Bashkir monolingual transducer to make more possible word forms analyzed.
  • Continue improving the coverage.
  • Check (and fix if necessary) the words, mostly proper nouns, which were added using auto translation.
  • Revise the dictionaries and drop duplicates.