User:Ilnar.salimzyan/GSoC2014

From Apertium
Jump to navigation Jump to search

Apertium-tat-rus – machine translation system from Tatar to Russian

This page is used for organiz thoughts and documenting the development process. If you are only interested in the workplan and stats, refer to 'Workplan' and 'Current state' sections of the Tatar and Russian page.

Post-application period

  • work on the 'James and Mary' translation
    • get rid of the debugging symbols
    • get the baseline WER
  • get permission to use one of the modern government-funded Tatar-Russian dictionaries under a free license and digitize it or fall back to one of the dictionaries in the public domain and scan that
  • read documentation on chunking based-transfer and papers describing other Apertium pairs for distant languages
  • acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
    • one should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.

Community-bonding period

Deliverables 0:

  1. testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)
  2. ocr'd public domain dictionary
  3. parallel corpus in /corpa is expanded with texts which represent domains the system could potentially be applied to (500 sentences?)