Difference between revisions of "Tatar and Russian"
Jump to navigation
Jump to search
Line 61: | Line 61: | ||
{|class=wikitable |
{|class=wikitable |
||
|- |
|- |
||
! Week |
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan="2"| Target !! !!colspan="2"| Achieved !!rowspan="2"| Evaluation !!rowspan="2"| Notes |
||
|- |
|- |
||
! Testvoc (category, type) !! Stems !! !! Testvoc clean? !! Stems |
|||
| 1 |
|||
⚫ | |||
⚫ | |||
|- |
|- |
||
⚫ | |||
| 2 |
|||
| Nouns, lite || -- || ||align=center| ✓ || 236 || -- || |
|||
⚫ | |||
⚫ | |||
* Testvoc-lite for adjectives clean |
|||
* All nouns from tat.lexc added to bidix |
|||
* At least 5 new phrase types supported |
|||
| |
|||
* ✓ |
|||
* ✗ |
|||
* ✗ |
|||
| |
|||
* |
|||
* 1900, but not all, nouns from tat.lexc added. Around 500 pending. |
|||
* |
|||
|- |
|- |
||
| |
| 2 || 26/05—01/06 |
||
| Adjectives, lite || All nouns<br/>from tat.lexc || ||align=center| ✓ || 2141 || -- || Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. |
|||
|- |
|- |
||
| |
| 3 || 02/06—08/06 |
||
⚫ | |||
|- |
|- |
||
| |
| 4 || 09/06—15/06 |
||
| Adverbs, lite || 6071 || ||align=center| || || -- || |
|||
|- |
|- |
||
| |
| 5 || 16/06—22/06 |
||
| Numerals, full || 8036 || ||align=center| || || -- || |
|||
|- |
|- |
||
| |
| 6 || 23/06—29/06 |
||
| Pronouns, full || 10000 || || ||align=center| || 500 words || '''Midterm evaluation''' |
|||
|- |
|- |
||
⚫ | |||
⚫ | |||
| All categories, full || || || ||align=center| || Gisting || '''Final evaluation''' |
|||
⚫ | |||
| 13 || 11/08—18/08 |
|||
⚫ | |||
|} |
|} |
||
* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. |
* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. |
||
* Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc. |
|||
* Evaluation is taking <math>n</math> words and performing an |
* Evaluation (except for gisting evaluation) is taking <math>n</math> words and performing an evaluation for post-edition word error rate (WER). The output for those <math>n</math> words should be clean. |
||
==See also== |
==See also== |
Revision as of 21:45, 2 June 2014
This is a language pair translating from Tatar to Russian. The pair is currently located in nursery.
Current state
Last updated | Testvoc (clean or not) | Corpus testvoc (no *, no */@, no */@/#) |
Stems in bidix | WER, PER on dev. corpus | Average WER, PER on unseen texts |
---|---|---|---|---|---|
02/06/2014 | No |
|
2141 | 71.05 %, 53.68 % | -- |
- Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
- Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2 NT = tat.NT.txt.bz2. Others are unambiguous.
- Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
- Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.
Workplan (GSoC 2014)
This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.
Major goals
- Clean testvoc
- 10000 top stems in bidix and at least 80% trimmed coverage
- Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
- Average WER on unseen texts below 50
Overview
Weeks 1-6 | Weeks 7-12 | Saturdays | ||
---|---|---|---|---|
get categor(y/ies) testvoc clean with one word -> |
<- add more stems to categor(y/ies) while preserving testvoc clean |
disambiguation | lexical selection | adding stems |
transfer rules for pending wiki tests (focus on phrases and clauses, not single words) |
Weekly schedule
Week | Dates | Target | Achieved | Evaluation | Notes | |||
---|---|---|---|---|---|---|---|---|
Testvoc (category, type) | Stems | Testvoc clean? | Stems | |||||
1 | 19/05—25/05 | Nouns, lite | -- | ✓ | 236 | -- | ||
2 | 26/05—01/06 | Adjectives, lite | All nouns from tat.lexc |
✓ | 2141 | -- | Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. | |
3 | 02/06—08/06 | Verbs, lite | 4106 | -- | ||||
4 | 09/06—15/06 | Adverbs, lite | 6071 | -- | ||||
5 | 16/06—22/06 | Numerals, full | 8036 | -- | ||||
6 | 23/06—29/06 | Pronouns, full | 10000 | 500 words | Midterm evaluation | |||
12 | 04/08—10/08 | All categories, full | Gisting | Final evaluation | ||||
13 | 11/08—18/08 | Installation and usage documentation for end-users (in Tatar/Russian) |
- Testvoc-lite (
apertium-tat-rus/testvoc/lite$ ./testvoc.sh
) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. - Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
- Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.