Difference between revisions of "Tatar and Russian"

Revision as of 11:37, 19 May 2014

TODO: add a stats table here in the manner it was done on pages for monolingual modules. Essential things to track:

testvoc (clean or not)
corpus testvoc (ratio of *, @ and # errors to the number of tokens in corpus) => trimmed coverage
number of stems in bidix
WER on the development corpus
WER on unseen text(s)

Testvoc	Corpus testvoc (no , no /@, no */@/#)	stems	WER	WER*
No	news(40.4, 40.2, 34.1) Quran(50.0, 49.9, 44.8)

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

get categor(y/ies) testvoc clean with one word ->	add more stems to categor(y/ies) while preserving testvoc clean ->	disambiguation ->	lexical selection
transfer rules for pending wiki tests

Trimmed coverage means the coverage the morphological analyser after being trimmed according to the bilingual dictionary of the pair, that is, only containing stems which are also in the bilingual dictionary.
Testvoc-lite for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Evaluation is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

Week	Dates	Goal
1	19/05—25/05	Testvoc-lite for nouns clean
12	04/08—10/08	Gisting evaluation
13	11/08—18/08	Installation and usage documentation for end-users (in Tatar/Russian)

@@ Line 11: / Line 11: @@
 * WER on the development corpus
 * WER on unseen text(s)
+{|class=wikitable
+! Testvoc !! Corpus testvoc (no *, no */@, no */@/#) !! stems !! WER !! WER<nowiki>*</nowiki>
+|-
+|   No    || news(40.4, 40.2, 34.1)<br />Quran(50.0, 49.9, 44.8)
+|-
+|}
 == Workplan (GSoC 2014) ==