Difference between revisions of "Tatar and Russian"

From Apertium
Jump to navigation Jump to search
(→‎Workplan (GSoC 2014): update the workplan)
Line 45: Line 45:
{|class=wikitable
{|class=wikitable
|-
|-
!colspan="2"| Weeks 1-6 !!colspan="2"| Weeks 7-12
!colspan="2"| Weeks 1-6 !!colspan="2"| Weeks 7-12 !! Saturdays
|-
|-
| get categor(y/ies) testvoc clean<br/>with one word -> || <- add more stems to categor(y/ies)<br/>while preserving testvoc clean || disambiguation || lexical selection
| get categor(y/ies) testvoc clean<br/>with one word ->
| <- (Saturdays) add more stems to categor(y/ies)<br/>while preserving testvoc clean
| disambiguation
|lexical selection
|rowspan="2" | adding stems
|-
|-
|colspan="4" style="text-align:center"| transfer rules for pending wiki tests (phrases and clauses, not single words)
|colspan="4" style="text-align:center"| transfer rules for pending wiki tests (focus on phrases and clauses, not single words)
|}
|}


Line 55: Line 59:
{|class=wikitable
{|class=wikitable
|-
|-
! Week !! Dates !! Goal !! Reached
! Week !! Dates !! Goals !! Reached
|-
|-
| 1
| 1 || 19/05&mdash;25/05 || Testvoc-lite for nouns clean ||align=center|
| 19/05&mdash;25/05
| Testvoc-lite for nouns clean ||align=center| ✓
|-
| 2
| 26/05&mdash;01/06
|
* Testvoc-lite for adjectives clean
* At least 5 new phrase types supported
* All nouns from tat.lexc added to bidix
|
|-
| 3 || 02/06&mdash;08/06 || Testvoc-lite for verbs clean ||align=center|
|-
| 4 || 09/06&mdash;15/06 || Testvoc-lite for adverbs clean ||align=center|
|-
| 5 || 16/06&mdash;22/06 || Testvoc-lite for numerals clean ||align=center|
|-
| 6 || 23/06&mdash;29/06 || Testvoc for pronouns clean ||align=center|
|-
|-
| 12 || 04/08&mdash;10/08 || Gisting evaluation
| 12 || 04/08&mdash;10/08 || Gisting evaluation
|-
|-
| 13 || 11/08&mdash;18/08 || Installation and usage documentation for end-users (in Tatar/Russian)
| 13 || 11/08&mdash;18/08 || Installation and usage documentation<br/>for end-users (in Tatar/Russian)
|}
|}



Revision as of 17:55, 26 May 2014

This is a language pair translating from Tatar to Russian. The pair is currently located in nursery.

Current state

Last updated Testvoc (clean or not) Corpus testvoc
(no *, no */@, no */@/#)
Stems in bidix WER, PER on dev. corpus WER, PER on unseen texts
26/05/2014 No
  • news(40.5, 40.3, 34.2)
  • wp(40.6, 40.3, 37.1)
  • aytmatov(56.2, 56.1 50.7)
  • NT(52.4, 52.1, 46.5)
  • Quran(50.1, 50.0, 44.8)
236 71.84 %, 55.00 % --
  • Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
  • Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
    • news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2 NT = tat.NT.txt.bz2. Others are unambiguous.
  • Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
  • Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

Workplan (GSoC 2014)

This is a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Major goals

  • Clean testvoc
  • 10000 top stems in bidix and at least 80% trimmed coverage
  • Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
  • Average WER on unseen texts below 50

Overview

Weeks 1-6 Weeks 7-12 Saturdays
get categor(y/ies) testvoc clean
with one word ->
<- (Saturdays) add more stems to categor(y/ies)
while preserving testvoc clean
disambiguation lexical selection adding stems
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)

Weekly schedule

Week Dates Goals Reached
1 19/05—25/05 Testvoc-lite for nouns clean
2 26/05—01/06
  • Testvoc-lite for adjectives clean
  • At least 5 new phrase types supported
  • All nouns from tat.lexc added to bidix
3 02/06—08/06 Testvoc-lite for verbs clean
4 09/06—15/06 Testvoc-lite for adverbs clean
5 16/06—22/06 Testvoc-lite for numerals clean
6 23/06—29/06 Testvoc for pronouns clean
12 04/08—10/08 Gisting evaluation
13 11/08—18/08 Installation and usage documentation
for end-users (in Tatar/Russian)
  • Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
  • Evaluation is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.