User:Ilnar.salimzyan/GSoC2014/Application
< User:Ilnar.salimzyan | GSoC2014
Jump to navigation
Jump to search
Revision as of 13:14, 14 May 2014 by Ilnar.salimzyan (talk | contribs) (Ilnar.salimzyan moved page User:Ilnar.salimzyan/Application to User:Ilnar.salimzyan/GSoC2014/Application)
You can find my proposal for GSoC 2014 here:
Post-application period
- work on the 'James and Mary' translation
get rid of the debugging symbols- get the baseline WER
- get permission to use one of the modern government-funded Tatar-Russian dictionaries under a free license and digitize it or fall back to one of the dictionaries in the public domain and scan that
- read documentation on chunking based-transfer and papers describing other Apertium pairs for distant languages
Chunking,Chunking: A full example, sme-nob paper, eus-eng paper, eng-kaz paper.
- acceptance tests for an Aperitum MT system are: regression tests on the wiki, corpus test (WER and number of [*@#] errors) and testvoc. Unit testing an Apertium MT system is testing its modules (modes). Figure out how to unit test each module.
- one should be able to run his tests without the internet connection. Keeping a copy of the 'regression tests' html page in the /dev solves this problem, but it doesn't allow us to add new tests while not having internet access. One way to deal with that is to have a local copy of regression tests in the wiki format, so that if you add new test while flying over the atlantic, you can copy paste them to the wiki page of the pair later.
Community-bonding period
Deliverables 0:
- testvoc script(s) which doesn't take forever to run (consider footnote #5 in the proposal)
- ocr'd public domain dictionary
- parallel corpus in /corpa is expanded with texts which represent domains the system could potentially be applied to (500 sentences?)