Difference between revisions of "Tatar and Russian"
|  (→Weekly schedule:  some more post-factum bookkeeping for revision 50068) | |||
| Line 142: | Line 142: | ||
| |- | |- | ||
| | 6 || 23/06—29/06 | | 6 || 23/06—29/06 | ||
| | Pronouns, full || 10000 | | Pronouns, full || 10000 || ||align=center| ✗|| 5529 || 1. 500 words x 2<br/>2. Try out assimilation evaluation toolkit if it's usable by that time. || '''Midterm evaluation'''<br/>Results when unknown word-marks (stars) are not removed<br/>tat-rus/texts/text1.* (full coverage):<br/>WER 66.73%, PER 56.48%<br/>tat-rus/texts/text2.* (not fully covered):<br/>WER 78.42%, PER 63.58%  | ||
| |- | |- | ||
| | 7 || 30/06—06/07 | | 7 || 30/06—06/07 | ||
Revision as of 16:38, 13 March 2015
This is a language pair translating from Tatar to Russian. The pair is currently located in nursery.
Current state
| Last updated | Testvoc (clean or not) | Corpus testvoc (no *, no */@, no */@/#) | Stems in bidix | WER, PER on dev. corpus | Average WER, PER on unseen texts | 
|---|---|---|---|---|---|
| 13/03/2015 | No | 
 | 6000 | 71.03%, 54.02% | See GSoC2014 evaluation results below. | 
- Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
- Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
 
- Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
- Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.
Installation
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.
If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.
If you are using the Apertium VirtualBox image, all those tools will be already installed.
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the apertium-tat and apertium-rus directories in the /languages module. So you have to check out and compile these two monolingual packages first:
mkdir languages cd languages svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-tat/ cd apertium-tat ./autogen.sh make cd .. svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-rus/ cd apertium-rus ./autogen.sh make cd ../..
After you're done with that, you have to check out and compile apertium-tat-rus itself, specifying where monolingual packages you've just compiled are located:
mkdir nursery cd nursery svn co http://svn.code.sf.net/p/apertium/svn/nursery/apertium-tat-rus/ cd apertium-tat-rus ./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ make
You can test the translator now:
echo "Мин китап укыйм." | apertium -d . tat-rus Я читаю книгу. cd ../.. mkdir trunk cd trunk svn co http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eval-translator/ cd .. cd nursery/apertium-tat-rus ./qa
./qa runs the whole regression test suite. It requires apertium-eval-translator.pl script from trunk and assumes that your directory structure follows that of the apertium repository:
                        ..
         /                   |                 \
    languages             nursery             trunk
    /        \                \                  |
apertium-tat  apertium-rus  apertium-tat-rus  apertium-eval-translator
Workplan (GSoC 2014)
This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.
Major goals
- Clean testvoc
- 10000 top stems in bidix and at least 80% trimmed coverage
- Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
- Average WER on unseen texts below 50
Overview
| Weeks 1-6 | Weeks 7-12 | Saturdays | ||
|---|---|---|---|---|
| get categor(y/ies) testvoc clean with one word -> | <- add more stems to categor(y/ies) while preserving testvoc clean | disambiguation | lexical selection | adding stems & cleaning testvoc | 
| transfer rules for pending wiki tests (focus on phrases and clauses, not single words) | ||||
Weekly schedule
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.
| Week | Dates | Target | Achieved | Evaluation | Notes | |||
|---|---|---|---|---|---|---|---|---|
| Testvoc (category, type) | Stems | Testvoc clean? | Stems | |||||
| 1 | 19/05—25/05 | Nouns, lite | -- | ✓ | 236 | -- | ||
| 2 | 26/05—01/06 | Adjectives, lite | All nouns from tat.lexc | ✓ | 2141 | -- | Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. | |
| 3 | 02/06—08/06 | Verbs, lite | 4106 | 95.27% | 3258 | -- | Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt | |
| 4 | 09/06—15/06 | Adverbs, full | 6071 | ✓ | 5331 | -- | Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s). | |
| 5 | 16/06—22/06 | Numerals, full | 8036 | ✓ | 5488 | -- | ||
| 6 | 23/06—29/06 | Pronouns, full | 10000 | ✗ | 5529 | 1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time. | Midterm evaluation Results when unknown word-marks (stars) are not removed tat-rus/texts/text1.* (full coverage): WER 66.73%, PER 56.48% tat-rus/texts/text2.* (not fully covered): WER 78.42%, PER 63.58% | |
| 7 | 30/06—06/07 | Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) | ✓ | -- | See apertium-tat/texts/corpus.ana.txt | |||
| 8 | 07/07—13/07 | Corpus testvoc clean on all of the available corpora | ✗ | -- | Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above). | |||
| 9 | 14/07—20/07 | |||||||
| 10 | 21/07—27/07 | |||||||
| 11 | 28/07—03/08 | |||||||
| 12 | 04/08—10/08 | Write Constraint Grammar for Tatar | -- | |||||
| 13 | 11/08—18/08 | All categories, full | 10000 | Gisting | Final evaluation | |||
- Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
- Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
- Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.

