Tatar and Russian

Current state

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	Average WER, PER on unseen texts
08/07/2014	No	news(85.0, 85.0, 77.3) wp(79.7, 79.6, 73.0) aytmatov(87.3, 87.1 80.3) NT(80.4, 80.3, 72.3) Quran(82.5, 82.3, 74.1)	5331	88.13 %, 71.53 %	--

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Weeks 1-6		Weeks 7-12		Saturdays
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection	adding stems & cleaning testvoc
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)				adding stems & cleaning testvoc

Week	Dates	Target		Achieved		Evaluation	Notes
Week	Dates	Testvoc (category, type)	Stems	Testvoc clean?	Stems	Evaluation	Notes
1	19/05—25/05	Nouns, lite	--	✓	236	--
2	26/05—01/06	Adjectives, lite	All nouns from tat.lexc	✓	2141	--	Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3	02/06—08/06	Verbs, lite	4106	95.27%	3258	--	Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt
4	09/06—15/06	Adverbs, full	6071	✓	5331	--	Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).
5	16/06—22/06	Numerals, full	8036	✓	5488	--
6	23/06—29/06	Pronouns, full	10000			1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time.	Midterm evaluation
7	30/06—06/07	Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later)
12	04/08—10/08	All categories, full				Gisting	Final evaluation
13	11/08—18/08	Installation and usage documentation for end-users (in Tatar/Russian)

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
Evaluation (except for gisting evaluation) is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.