Tatar and Russian
|Line 1:||Line 1:|
This is a language pair translating from [[Tatar]] to [[Russian]]. The pair is currently located in [https://github.com/apertium/apertium-tat-rus nursery].
This is a language pair translating from [[Tatar]] to [[Russian]]. The pair is currently located in [https://github.com/apertium/apertium-tat-rus GitHub].
== Current state ==
== Current state ==
Latest revision as of 13:48, 9 March 2018
 Current state
|Last updated||Testvoc (clean or not)|| Corpus testvoc
(no *, no */@, no */@/# errors)
(coverage of the trimmed Tatar morphological analyser, coverage of the Tatar morphological analyser and of the bilingual dictionary, coverage of the whole translator)
|Stems in the bilingual dictionary||WER, PER on dev. corpus||Average WER, PER on unseen texts|
||6000||71.03%, 54.02%||See GSoC2014 evaluation results below.|
- Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
- Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
- Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
- Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.
If you are using the Apertium VirtualBox image, all those tools will be already installed.
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the
apertium-rus directories in the
/languages module. So you have to check out and compile these two monolingual packages first:
mkdir languages cd languages git clone https://github.com/apertium/apertium-tat.git cd apertium-tat ./autogen.sh make cd .. git clone https://github.com/apertium/apertium-rus.git cd apertium-rus ./autogen.sh make cd ../..
After you're done with that, you have to check out and compile
apertium-tat-rus itself, specifying where monolingual packages you've just compiled are located:
mkdir nursery cd nursery git clone https://github.com/apertium/apertium-tat-rus.git cd apertium-tat-rus ./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ make
You can test the translator now:
echo "Мин китап укыйм." | apertium -d . tat-rus Я читаю книгу. cd ../.. mkdir trunk cd trunk git clone https://github.com/apertium/apertium-eval-translator.git cd .. cd nursery/apertium-tat-rus ./qa
./qa runs the whole regression test suite. It requires
apertium-eval-translator.pl script from trunk and assumes that your directory structure follows that of the apertium repository:
.. / | \ languages nursery trunk / \ \ | apertium-tat apertium-rus apertium-tat-rus apertium-eval-translator
 Workplan (GSoC 2014)
This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.
 Major goals
- Clean testvoc
- 10000 top stems in bidix and at least 80% trimmed coverage
- Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
- Average WER on unseen texts below 50
|Weeks 1-6||Weeks 7-12||Saturdays|
| get categor(y/ies) testvoc clean
with one word ->
| <- add more stems to categor(y/ies)
while preserving testvoc clean
|disambiguation||lexical selection||adding stems & cleaning testvoc|
|transfer rules for pending wiki tests (focus on phrases and clauses, not single words)|
 Weekly schedule
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.
|Testvoc (category, type)||Stems||Testvoc clean?||Stems|
|2||26/05—01/06||Adjectives, lite|| All nouns
|✓||2141||--||Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.|
|3||02/06—08/06||Verbs, lite||4106||95.27%||3258||--||Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt|
|4||09/06—15/06||Adverbs, full||6071||✓||5331||--||Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).|
|6||23/06—29/06||Pronouns, full||10000||✗||5529|| 1. 500 words x 2
2. Try out assimilation evaluation toolkit if it's usable by that time.
| Midterm evaluation|
Results when unknown word-marks (stars) are not removed
tat-rus/texts/text1.* (full coverage):
WER 66.73%, PER 56.48%
tat-rus/texts/text2.* (not fully covered):
WER 78.42%, PER 63.58%
|7||30/06—06/07||Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later)||✓||--||See apertium-tat/corpus/corpus.ana.txt|
|8||07/07—13/07||Corpus testvoc clean on all of the available corpora||✗||--||Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above).|
|12||04/08—10/08||Write Constraint Grammar for Tatar||111 rules||--||
apertium-tat$ wc -l corpus/corpus.ana.txt 15090 corpus/corpus.ana.txt apertium-tat$ ./qa cg False negatives: 589; False positives: 126
|13||11/08—18/08||All categories, full||10000||✗||6000||Gisting||Final evaluation|
- Testvoc-lite (
apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
- Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
- Evaluation (except for gisting evaluation) is taking n words and performing an evaluation for post-edition word error rate (WER). The output for those n words should be clean.