Difference between revisions of "Tatar and Russian"

From Apertium
Jump to navigation Jump to search
(→‎Current state: update corpus test stats on wikipedia)
(→‎Current state: number of stems)
Line 20: Line 20:
* NT(83.0, 83.0, 77.8)
* NT(83.0, 83.0, 77.8)
* Quran(85.3, 85.3, 80.4)
* Quran(85.3, 85.3, 80.4)
| 5934
| 5956
| 75.64 %, 58.79 %
| 75.64 %, 58.79 %
| --
| --

Revision as of 19:42, 29 July 2014

This is a language pair translating from Tatar to Russian. The pair is currently located in nursery.

Current state

Last updated Testvoc (clean or not) Corpus testvoc
(no *, no */@, no */@/#)
Stems in bidix WER, PER on dev. corpus Average WER, PER on unseen texts
20/07/2014 No
  • news(86.3, 86.3, 78.0)
  • wp(83.0, 83.0, 78.2)
  • aytmatov(90.0, 90.0 87.5)
  • NT(83.0, 83.0, 77.8)
  • Quran(85.3, 85.3, 80.4)
5956 75.64 %, 58.79 % --
  • Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
  • Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
    • news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
  • Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
  • Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

Workplan (GSoC 2014)

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Major goals

  • Clean testvoc
  • 10000 top stems in bidix and at least 80% trimmed coverage
  • Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
  • Average WER on unseen texts below 50

Overview

Weeks 1-6 Weeks 7-12 Saturdays
get categor(y/ies) testvoc clean
with one word ->
<- add more stems to categor(y/ies)
while preserving testvoc clean
disambiguation lexical selection adding stems & cleaning testvoc
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)

Weekly schedule

Week Dates Target Achieved Evaluation Notes
Testvoc (category, type) Stems Testvoc clean? Stems
1 19/05—25/05 Nouns, lite -- 236 --
2 26/05—01/06 Adjectives, lite All nouns
from tat.lexc
2141 -- Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3 02/06—08/06 Verbs, lite 4106 95.27% 3258 -- Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt
4 09/06—15/06 Adverbs, full 6071 5331 -- Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).
5 16/06—22/06 Numerals, full 8036 5488 --
6 23/06—29/06 Pronouns, full 10000 1. 500 words x 2
2. Try out assimilation evaluation toolkit if it's usable by that time.
Midterm evaluation
Results when unknown word-marks (stars) are not removed
tat-rus/texts/text1.* (full coverage):
WER 66.73%, PER 56.48%
tat-rus/texts/text2.* (not fully covered):
WER 78.42%, PER 63.58%
7 30/06—06/07 Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) -- See apertium-tat/texts/corpus.ana.txt
8 07/07—13/07 --
9 14/07—20/07 --
10 21/07—27/07 --
11 28/07—03/07 Corpus test clean on all of the available corpora --
12 04/08—10/08 Write Constraint Grammar for Tatar --
13 11/08—18/08 All categories, full 10000 Gisting Final evaluation
  • Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
  • Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
  • Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.

See also