Difference between revisions of "Tatar and Russian"

From Apertium
Jump to navigation Jump to search
Line 157: Line 157:
|-
|-
| 12|| 04/08—10/08
| 12|| 04/08—10/08
|colspan="2"| Write Constraint Grammar for Tatar || ||colspan="2" style="text-align: center"| || -- ||
|colspan="2"| Write Constraint Grammar for Tatar || 111 rules in total ||colspan="2" style="text-align: center"| || -- ||
|-
|-
| 13 || 11/08—18/08
| 13 || 11/08—18/08

Revision as of 17:15, 13 March 2015

This is a language pair translating from Tatar to Russian. The pair is currently located in nursery.

Current state

Last updated Testvoc (clean or not) Corpus testvoc
(no *, no */@, no */@/#)
Stems in bidix WER, PER on dev. corpus Average WER, PER on unseen texts
13/03/2015 No
  • news(86.3, 86.3, 82.3)
  • wp(83.0, 83.0, 80.0)
  • Quran(85.4, 85.4, 82.3)
  • NT(83.0, 83.0, 80.0)
  • aytmatov(90.1, 90.1 88.3)
6000 71.03%, 54.02% See GSoC2014 evaluation results below.
  • Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
  • Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
    • news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
  • Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
  • Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

Installation

You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.

If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.

If you are using the Apertium VirtualBox image, all those tools will be already installed.

Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the apertium-tat and apertium-rus directories in the /languages module. So you have to check out and compile these two monolingual packages first:

mkdir languages
cd languages
svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-tat/
cd apertium-tat
./autogen.sh
make
cd ..

svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-rus/
cd apertium-rus
./autogen.sh
make
cd ../..

After you're done with that, you have to check out and compile apertium-tat-rus itself, specifying where monolingual packages you've just compiled are located:

mkdir nursery
cd nursery
svn co http://svn.code.sf.net/p/apertium/svn/nursery/apertium-tat-rus/
cd apertium-tat-rus
./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/
make

You can test the translator now:

echo "Мин китап укыйм." | apertium -d . tat-rus
Я читаю книгу.

cd ../..
mkdir trunk
cd trunk
svn co http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eval-translator/
cd ..
cd nursery/apertium-tat-rus
./qa

./qa runs the whole regression test suite. It requires apertium-eval-translator.pl script from trunk and assumes that your directory structure follows that of the apertium repository:

                        ..
         /                   |                 \
    languages             nursery             trunk
    /        \                \                  |
apertium-tat  apertium-rus  apertium-tat-rus  apertium-eval-translator

Workplan (GSoC 2014)

This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Major goals

  • Clean testvoc
  • 10000 top stems in bidix and at least 80% trimmed coverage
  • Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
  • Average WER on unseen texts below 50

Overview

Weeks 1-6 Weeks 7-12 Saturdays
get categor(y/ies) testvoc clean
with one word ->
<- add more stems to categor(y/ies)
while preserving testvoc clean
disambiguation lexical selection adding stems & cleaning testvoc
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)

Weekly schedule

This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.

Week Dates Target Achieved Evaluation Notes
Testvoc (category, type) Stems Testvoc clean? Stems
1 19/05—25/05 Nouns, lite -- 236 --
2 26/05—01/06 Adjectives, lite All nouns
from tat.lexc
2141 -- Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3 02/06—08/06 Verbs, lite 4106 95.27% 3258 -- Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt
4 09/06—15/06 Adverbs, full 6071 5331 -- Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).
5 16/06—22/06 Numerals, full 8036 5488 --
6 23/06—29/06 Pronouns, full 10000 5529 1. 500 words x 2
2. Try out assimilation evaluation toolkit if it's usable by that time.
Midterm evaluation
Results when unknown word-marks (stars) are not removed
tat-rus/texts/text1.* (full coverage):
WER 66.73%, PER 56.48%
tat-rus/texts/text2.* (not fully covered):
WER 78.42%, PER 63.58%
7 30/06—06/07 Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) -- See apertium-tat/corpus/corpus.ana.txt
8 07/07—13/07 Corpus testvoc clean on all of the available corpora -- Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above).
9 14/07—20/07
10 21/07—27/07
11 28/07—03/08
12 04/08—10/08 Write Constraint Grammar for Tatar 111 rules in total --
13 11/08—18/08 All categories, full 10000 Gisting Final evaluation
  • Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
  • Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
  • Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.

See also