Difference between revisions of "Tatar and Russian"
(→Workplan (GSoC 2014): set up a template) |
|||
(94 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
This is a language pair translating |
This is a language pair translating from [[Tatar]] to [[Russian]]. The pair is currently located in [https://github.com/apertium/apertium-tat-rus GitHub]. |
||
== Current state == |
== Current state == |
||
{|class=wikitable |
|||
TODO: add a stats table here in the manner it was done on pages for monolingual modules. |
|||
! Last updated |
|||
Essential things to track (following [[Turkic-Turkic translator]] page: |
|||
! Testvoc (clean or not) |
|||
! Corpus testvoc<br/>(no *, no */@, no */@/# errors)<br/>(coverage of the trimmed Tatar morphological analyser, coverage of the Tatar morphological analyser and of the bilingual dictionary, coverage of the whole translator) |
|||
* trimmed coverage |
|||
! Stems in the bilingual dictionary |
|||
* number of stems in bidix |
|||
! WER, PER on dev. corpus |
|||
! Average WER, PER on unseen texts |
|||
|- |
|||
! 13/03/2015 |
|||
| No |
|||
| |
|||
* news(86.3, 86.3, 82.3) |
|||
* wp(83.0, 83.0, 80.0) |
|||
* Quran(85.4, 85.4, 82.3) |
|||
* NT(83.0, 83.0, 80.0) |
|||
* aytmatov(90.1, 90.1 88.3) |
|||
| 6000 |
|||
| 71.03%, 54.02% |
|||
| See GSoC2014 evaluation results below. |
|||
|- |
|||
|} |
|||
* Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh |
|||
* Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository. |
|||
** news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous. |
|||
* Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces. |
|||
* Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed. |
|||
== Installation == |
|||
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus. |
|||
If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from [[User:Tino Didriksen]]'s [[Prerequisites for Debian|repository]]. |
|||
If you are using the [[Apertium VirtualBox]] image, all those tools will be already installed. |
|||
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the <code>apertium-tat</code> and <code>apertium-rus</code> directories in the <code>/languages</code> module. So you have to check out and compile these two monolingual packages first: |
|||
<pre> |
|||
mkdir languages |
|||
cd languages |
|||
git clone https://github.com/apertium/apertium-tat.git |
|||
cd apertium-tat |
|||
./autogen.sh |
|||
make |
|||
cd .. |
|||
git clone https://github.com/apertium/apertium-rus.git |
|||
cd apertium-rus |
|||
./autogen.sh |
|||
make |
|||
cd ../.. |
|||
</pre> |
|||
After you're done with that, you have to check out and compile <code>apertium-tat-rus</code> itself, specifying where monolingual packages you've just compiled are located: |
|||
<pre> |
|||
mkdir nursery |
|||
cd nursery |
|||
git clone https://github.com/apertium/apertium-tat-rus.git |
|||
cd apertium-tat-rus |
|||
./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ |
|||
make |
|||
</pre> |
|||
You can test the translator now: |
|||
<pre> |
|||
echo "Мин китап укыйм." | apertium -d . tat-rus |
|||
Я читаю книгу. |
|||
cd ../.. |
|||
mkdir trunk |
|||
cd trunk |
|||
git clone https://github.com/apertium/apertium-eval-translator.git |
|||
cd .. |
|||
cd nursery/apertium-tat-rus |
|||
./qa |
|||
</pre> |
|||
<code>./qa</code> runs the whole regression test suite. It requires <code>apertium-eval-translator.pl</code> script from trunk and assumes that your directory structure follows that of the apertium repository: |
|||
<pre> |
|||
.. |
|||
/ | \ |
|||
languages nursery trunk |
|||
/ \ \ | |
|||
apertium-tat apertium-rus apertium-tat-rus apertium-eval-translator |
|||
</pre> |
|||
== Workplan (GSoC 2014) == |
== Workplan (GSoC 2014) == |
||
This |
This was a workplan for development efforts for the Tatar to Russian translator in [[Google Summer of Code]] 2014. |
||
=== Major goals === |
|||
* Trimmed coverage means the coverage the morphological analyser after being trimmed according to the bilingual dictionary of the pair, that is, only containing stems which are also in the bilingual dictionary. |
|||
* Testvoc for a category means that the category is [[testvoc]]. |
|||
* Clean testvoc |
|||
* Evaluation is taking <math>n</math> words and performing an [[evaluation]] for post-edition word error rate (WER). The output for those <math>n</math> words should be clean. |
|||
* 10000 top stems in bidix and at least 80% trimmed coverage |
|||
* Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis. |
|||
* Average WER on unseen texts below 50 |
|||
=== Overview === |
|||
{|class=wikitable |
{|class=wikitable |
||
|- |
|- |
||
!colspan="2"| Weeks 1-6 !!colspan="2"| Weeks 7-12 !! Saturdays |
|||
! Week !! Dates || |
|||
|- |
|- |
||
| get categor(y/ies) testvoc clean<br/>with one word -> |
|||
| 1 || 19/05—25/05 || |
|||
| <- add more stems to categor(y/ies)<br/>while preserving testvoc clean |
|||
| disambiguation |
|||
|lexical selection||rowspan="2" | adding stems & cleaning testvoc |
|||
|- |
|- |
||
|colspan="4" style="text-align:center"| transfer rules for pending wiki tests (focus on phrases and clauses, not single words) |
|||
| 2 || 26/05—01/06 || |
|||
|} |
|||
=== Weekly schedule === |
|||
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014. |
|||
{|class=wikitable |
|||
|- |
|- |
||
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan="2"| Target !! !!colspan="2"| Achieved !!rowspan="2"| Evaluation !!rowspan="2"| Notes |
|||
| 3 || 02/06—08/06 || |
|||
|- |
|- |
||
! Testvoc (category, type) !! Stems !! !! Testvoc clean? !! Stems |
|||
| 4 || 09/06—15/06 || |
|||
|- |
|- |
||
| |
| 1 || 19/05—25/05 |
||
| Nouns, lite || -- || ||align=center| ✓ || 236 || -- || |
|||
|- |
|- |
||
| |
| 2 || 26/05—01/06 |
||
| Adjectives, lite || All nouns<br/>from tat.lexc || ||align=center| ✓ || 2141 || -- || Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. |
|||
|- |
|- |
||
| |
| 3 || 02/06—08/06 |
||
| Verbs, lite || 4106 || ||align=center| 95.27% || 3258 || -- || Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt |
|||
|- |
|- |
||
| |
| 4 || 09/06—15/06 |
||
| Adverbs, full || 6071 || ||align=center| ✓ || 5331 || -- || Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s). |
|||
|- |
|- |
||
| |
| 5 || 16/06—22/06 |
||
| Numerals, full || 8036 || ||align=center| ✓ || 5488 || -- || |
|||
|- |
|- |
||
| |
| 6 || 23/06—29/06 |
||
| Pronouns, full || 10000 || ||align=center| ✗|| 5529 || 1. 500 words x 2<br/>2. Try out assimilation evaluation toolkit if it's usable by that time. || '''Midterm evaluation'''<br/>Results when unknown word-marks (stars) are not removed<br/>tat-rus/texts/text1.* (full coverage):<br/>WER 66.73%, PER 56.48%<br/>tat-rus/texts/text2.* (not fully covered):<br/>WER 78.42%, PER 63.58% |
|||
|- |
|- |
||
| |
| 7 || 30/06—06/07 |
||
|colspan="2"|Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) || ||colspan="2" style="text-align: center"| ✓ || -- || See apertium-tat/corpus/corpus.ana.txt |
|||
|- |
|- |
||
| |
| 8 || 07/07—13/07 |
||
|colspan="2" rowspan="4"| Corpus testvoc clean on all of the available corpora ||rowspan="4"| ||rowspan="4" colspan="2" style="text-align: center"| ✗||rowspan="4"| -- ||rowspan="4"| Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above). |
|||
|- |
|- |
||
| |
| 9 || 14/07—20/07 |
||
|- |
|- |
||
| 10|| 21/07—27/07 |
|||
|- |
|||
| 11|| 28/07—03/08 |
|||
|- |
|||
| 12|| 04/08—10/08 |
|||
|colspan="2"| Write Constraint Grammar for Tatar || |
|||
|colspan="2" align=center| 111 rules || -- |
|||
| |
|||
<pre> |
|||
apertium-tat$ wc -l corpus/corpus.ana.txt |
|||
15090 corpus/corpus.ana.txt |
|||
apertium-tat$ ./qa cg |
|||
False negatives: 589; False positives: 126 |
|||
</pre> |
|||
|- |
|||
| 13 || 11/08—18/08 |
|||
| All categories, full || 10000 || ||align=center| ✗ || 6000|| [https://svn.code.sf.net/p/apertium/svn/branches/papers/2015-eamt-assim/tat-rus/ Gisting] || '''Final evaluation''' |
|||
|} |
|} |
||
* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. |
|||
* Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc. |
|||
* Evaluation (except for gisting evaluation) is taking <math>n</math> words and performing an evaluation for post-edition word error rate (WER). The output for those <math>n</math> words should be clean. |
|||
==See also== |
|||
* [[/Pending tests|Pending tests]] |
|||
* [[/Regression tests|Regression tests]] |
|||
* [[Apertium-tat-rus/stats|Stats]] |
|||
[[Category:Tatar and Russian|*]] |
[[Category:Tatar and Russian|*]] |
Latest revision as of 12:48, 9 March 2018
This is a language pair translating from Tatar to Russian. The pair is currently located in GitHub.
Current state[edit]
Last updated | Testvoc (clean or not) | Corpus testvoc (no *, no */@, no */@/# errors) (coverage of the trimmed Tatar morphological analyser, coverage of the Tatar morphological analyser and of the bilingual dictionary, coverage of the whole translator) |
Stems in the bilingual dictionary | WER, PER on dev. corpus | Average WER, PER on unseen texts |
---|---|---|---|---|---|
13/03/2015 | No |
|
6000 | 71.03%, 54.02% | See GSoC2014 evaluation results below. |
- Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
- Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
- Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
- Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.
Installation[edit]
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.
If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.
If you are using the Apertium VirtualBox image, all those tools will be already installed.
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the apertium-tat
and apertium-rus
directories in the /languages
module. So you have to check out and compile these two monolingual packages first:
mkdir languages cd languages git clone https://github.com/apertium/apertium-tat.git cd apertium-tat ./autogen.sh make cd .. git clone https://github.com/apertium/apertium-rus.git cd apertium-rus ./autogen.sh make cd ../..
After you're done with that, you have to check out and compile apertium-tat-rus
itself, specifying where monolingual packages you've just compiled are located:
mkdir nursery cd nursery git clone https://github.com/apertium/apertium-tat-rus.git cd apertium-tat-rus ./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ make
You can test the translator now:
echo "Мин китап укыйм." | apertium -d . tat-rus Я читаю книгу. cd ../.. mkdir trunk cd trunk git clone https://github.com/apertium/apertium-eval-translator.git cd .. cd nursery/apertium-tat-rus ./qa
./qa
runs the whole regression test suite. It requires apertium-eval-translator.pl
script from trunk and assumes that your directory structure follows that of the apertium repository:
.. / | \ languages nursery trunk / \ \ | apertium-tat apertium-rus apertium-tat-rus apertium-eval-translator
Workplan (GSoC 2014)[edit]
This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.
Major goals[edit]
- Clean testvoc
- 10000 top stems in bidix and at least 80% trimmed coverage
- Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
- Average WER on unseen texts below 50
Overview[edit]
Weeks 1-6 | Weeks 7-12 | Saturdays | ||
---|---|---|---|---|
get categor(y/ies) testvoc clean with one word -> |
<- add more stems to categor(y/ies) while preserving testvoc clean |
disambiguation | lexical selection | adding stems & cleaning testvoc |
transfer rules for pending wiki tests (focus on phrases and clauses, not single words) |
Weekly schedule[edit]
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.
Week | Dates | Target | Achieved | Evaluation | Notes | |||
---|---|---|---|---|---|---|---|---|
Testvoc (category, type) | Stems | Testvoc clean? | Stems | |||||
1 | 19/05—25/05 | Nouns, lite | -- | ✓ | 236 | -- | ||
2 | 26/05—01/06 | Adjectives, lite | All nouns from tat.lexc |
✓ | 2141 | -- | Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. | |
3 | 02/06—08/06 | Verbs, lite | 4106 | 95.27% | 3258 | -- | Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt | |
4 | 09/06—15/06 | Adverbs, full | 6071 | ✓ | 5331 | -- | Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s). | |
5 | 16/06—22/06 | Numerals, full | 8036 | ✓ | 5488 | -- | ||
6 | 23/06—29/06 | Pronouns, full | 10000 | ✗ | 5529 | 1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time. |
Midterm evaluation Results when unknown word-marks (stars) are not removed tat-rus/texts/text1.* (full coverage): WER 66.73%, PER 56.48% tat-rus/texts/text2.* (not fully covered): WER 78.42%, PER 63.58% | |
7 | 30/06—06/07 | Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) | ✓ | -- | See apertium-tat/corpus/corpus.ana.txt | |||
8 | 07/07—13/07 | Corpus testvoc clean on all of the available corpora | ✗ | -- | Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above). | |||
9 | 14/07—20/07 | |||||||
10 | 21/07—27/07 | |||||||
11 | 28/07—03/08 | |||||||
12 | 04/08—10/08 | Write Constraint Grammar for Tatar | 111 rules | -- |
apertium-tat$ wc -l corpus/corpus.ana.txt 15090 corpus/corpus.ana.txt apertium-tat$ ./qa cg False negatives: 589; False positives: 126 | |||
13 | 11/08—18/08 | All categories, full | 10000 | ✗ | 6000 | Gisting | Final evaluation |
- Testvoc-lite (
apertium-tat-rus/testvoc/lite$ ./testvoc.sh
) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. - Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
- Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.