Difference between revisions of "Tatar and Russian"

Revision as of 17:07, 26 May 2014

Current state

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	WER, PER on unseen texts
26/05/2014	No	news(40.5, 40.3, 34.2) wp(40.6, 40.3, 37.1) aytmatov(56.2, 56.1 50.7) NT(52.4, 52.1, 46.5) Quran(50.1, 50.0, 44.8)	236	71.84 %, 55.00 %	--

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2 NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

Workplan (GSoC 2014)

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Major goals

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Overview

Weeks 1-6		Weeks 7-12
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection
transfer rules for pending wiki tests (phrases and clauses, not single words)

Weekly schedule

Week	Dates	Goal	Reached
1	19/05—25/05	Testvoc-lite for nouns clean	✓
12	04/08—10/08	Gisting evaluation
13	11/08—18/08	Installation and usage documentation for end-users (in Tatar/Russian)

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Evaluation is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

@@ Line 45: / Line 45: @@
 {|class=wikitable
 |-
+!colspan="2"| Weeks 1-6 !!colspan="2"| Weeks 7-12
-| get categor(y/ies) testvoc clean <br/>with one word ->|| add more stems to categor(y/ies)<br/>while preserving testvoc clean ->|| disambiguation ->|| lexical selection
 |-
+| get categor(y/ies) testvoc clean<br/>with one word -> || <- add more stems to categor(y/ies)<br/>while preserving testvoc clean || disambiguation || lexical selection
-|colspan="4" style="text-align:center"| transfer rules for pending wiki tests
+|-
+|colspan="4" style="text-align:center"| transfer rules for pending wiki tests (phrases and clauses, not single words)
 |}
-=== Terminology ===
+=== Weekly schedule ===
-* Trimmed coverage means the coverage the morphological analyzer after being trimmed according to the bilingual dictionary of the pair, that is, only containing stems which are also in the bilingual dictionary.
-* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
-* Evaluation is taking <math>n</math> words and performing an [[evaluation]] for post-edition word error rate (WER). The output for those <math>n</math> words should be clean.
 {|class=wikitable
 |-
@@ Line 65: / Line 63: @@
 | 13      || 11/08&mdash;18/08  || Installation and usage documentation for end-users (in Tatar/Russian)
 |}
+* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
+* Evaluation is taking <math>n</math> words and performing an [[evaluation]] for post-edition word error rate (WER). The output for those <math>n</math> words should be clean.
 [[Category:Tatar and Russian|*]]

Difference between revisions of "Tatar and Russian"

Revision as of 17:07, 26 May 2014

Contents

Current state

Workplan (GSoC 2014)

Major goals

Overview

Weekly schedule

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools