Difference between revisions of "Tatar and Russian"

Revision as of 19:39, 29 July 2014

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	Average WER, PER on unseen texts
20/07/2014	No	news(86.3, 86.3, 78.0) wp(83.0, 83.0, 78.2) aytmatov(90.0, 90.0 87.5) NT(83.0, 83.0, 77.8) Quran(85.3, 85.3, 80.4)	5934	75.64 %, 58.79 %	--

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Weeks 1-6		Weeks 7-12		Saturdays
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection	adding stems & cleaning testvoc
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)				adding stems & cleaning testvoc

Week	Dates	Target		Achieved		Evaluation	Notes
Week	Dates	Testvoc (category, type)	Stems	Testvoc clean?	Stems	Evaluation	Notes
1	19/05—25/05	Nouns, lite	--	✓	236	--
2	26/05—01/06	Adjectives, lite	All nouns from tat.lexc	✓	2141	--	Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3	02/06—08/06	Verbs, lite	4106	95.27%	3258	--	Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt
4	09/06—15/06	Adverbs, full	6071	✓	5331	--	Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).
5	16/06—22/06	Numerals, full	8036	✓	5488	--
6	23/06—29/06	Pronouns, full	10000			1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time.	Midterm evaluation Results when unknown word-marks (stars) are not removed tat-rus/texts/text1.* (full coverage): WER 66.73%, PER 56.48% tat-rus/texts/text2.* (not fully covered): WER 78.42%, PER 63.58%
7	30/06—06/07	Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later)		✓		--	See apertium-tat/texts/corpus.ana.txt
8	07/07—13/07					--
9	14/07—20/07					--
10	21/07—27/07					--
11	28/07—03/07	Corpus test clean on all of the available corpora				--
12	04/08—10/08	Write Constraint Grammar for Tatar				--
13	11/08—18/08	All categories, full	10000			Gisting	Final evaluation

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
Evaluation (except for gisting evaluation) is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

@@ Line 16: / Line 16: @@
 |
 * news(86.3, 86.3, 78.0)
-* wp(83.0, 83.0, 77.3)
+* wp(83.0, 83.0, 78.2)
 * aytmatov(90.0, 90.0 87.5)
-* NT(83.0, 83.0, 78.2)
+* NT(83.0, 83.0, 77.8)
 * Quran(85.3, 85.3, 80.4)
 | 5934