Difference between revisions of "Tatar and Russian"

Revision as of 18:46, 2 June 2014

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	WER, PER on unseen texts
02/06/2014	No	news(67.6, 67.4, 59.4) wp(66.9, 66.6, 61.1) aytmatov(71.7, 71.5 66.8) NT(67.1, 66.8, 62.0) Quran(68.9, 68.7, 63.3)	2141	71.05 %, 53.68 %	--

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2 NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Weeks 1-6		Weeks 7-12		Saturdays
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)				adding stems
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection	adding stems

Week	Dates	Goals	Reached
1	19/05—25/05	Testvoc-lite for nouns clean	✓
2	26/05—01/06	Testvoc-lite for adjectives clean At least 5 new phrase types supported All nouns from tat.lexc added to bidix	✓
3	02/06—08/06	Testvoc-lite for verbs clean
4	09/06—15/06	Testvoc-lite for adverbs clean
5	16/06—22/06	Testvoc-lite for numerals clean
6	23/06—29/06	Testvoc for pronouns clean
12	04/08—10/08	Gisting evaluation
13	11/08—18/08	Installation and usage documentation for end-users (in Tatar/Russian)

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Evaluation is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

@@ Line 50: / Line 50: @@
 |-
 | get categor(y/ies) testvoc clean<br/>with one word ->
-| <- (Saturdays) add more stems to categor(y/ies)<br/>while preserving testvoc clean
+| <- add more stems to categor(y/ies)<br/>while preserving testvoc clean
 | disambiguation
 |lexical selection