Difference between revisions of "Tatar and Russian"

Revision as of 22:12, 10 June 2014

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	Average WER, PER on unseen texts
10/06/2014	No	news(80.0, 79.8, 70.0) wp(77.1, 76.7, 70.2) aytmatov(84.4, 84.2 78.8) NT(77.6, 77.3, 71.8) Quran(79.8, 79.6, 74.0)	3258	71.05 %, 53.42 %	--

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2 NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Weeks 1-6		Weeks 7-12		Saturdays
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection	adding stems
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)				adding stems

Week	Dates	Target		Achieved		Evaluation	Notes
Week	Dates	Testvoc (category, type)	Stems	Testvoc clean?	Stems	Evaluation	Notes
1	19/05—25/05	Nouns, lite	--	✓	236	--
2	26/05—01/06	Adjectives, lite	All nouns from tat.lexc	✓	2141	--	Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3	02/06—08/06	Verbs, lite	4106			--
4	09/06—15/06	Adverbs, full	6071			--
5	16/06—22/06	Numerals, full	8036			--
6	23/06—29/06	Pronouns, full	10000			500 words	Midterm evaluation
12	04/08—10/08	All categories, full				Gisting	Final evaluation
13	11/08—18/08	Installation and usage documentation for end-users (in Tatar/Russian)

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
Evaluation (except for gisting evaluation) is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

@@ Line 12: / Line 12: @@
 ! Average WER, PER on unseen texts
 |-
-! 02/06/2014
+! 10/06/2014
 |   No
 |
-* news(67.6, 67.4, 59.4)
+* news(80.0, 79.8, 70.0)
-* wp(66.9, 66.6, 61.1)
+* wp(77.1, 76.7, 70.2)
-* aytmatov(71.7, 71.5 66.8)
+* aytmatov(84.4, 84.2 78.8)
-* NT(67.1, 66.8, 62.0)
+* NT(77.6, 77.3, 71.8)
-* Quran(68.9, 68.7, 63.3)
+* Quran(79.8, 79.6, 74.0)
-| 2141
+| 3258
-| 71.05 %, 53.68 %
+| 71.05 %, 53.42 %
 | --
 |-