Difference between revisions of "Tatar and Russian"
m |
|||
(97 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
{{TOCD}} |
||
This is a language pair translating |
This is a language pair translating from [[Tatar]] to [[Russian]]. The pair is currently located in [https://github.com/apertium/apertium-tat-rus GitHub]. |
||
== Current state == |
== Current state == |
||
{|class=wikitable |
|||
TODO: add a stats table here in the manner it was done on pages for monolingual modules. |
|||
! Last updated |
|||
! Testvoc (clean or not) |
|||
! Corpus testvoc<br/>(no *, no */@, no */@/# errors)<br/>(coverage of the trimmed Tatar morphological analyser, coverage of the Tatar morphological analyser and of the bilingual dictionary, coverage of the whole translator) |
|||
! Stems in the bilingual dictionary |
|||
! WER, PER on dev. corpus |
|||
! Average WER, PER on unseen texts |
|||
|- |
|||
! 13/03/2015 |
|||
| No |
|||
| |
|||
* news(86.3, 86.3, 82.3) |
|||
* wp(83.0, 83.0, 80.0) |
|||
* Quran(85.4, 85.4, 82.3) |
|||
* NT(83.0, 83.0, 80.0) |
|||
* aytmatov(90.1, 90.1 88.3) |
|||
| 6000 |
|||
| 71.03%, 54.02% |
|||
| See GSoC2014 evaluation results below. |
|||
|- |
|||
|} |
|||
* Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh |
|||
* Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository. |
|||
** news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous. |
|||
* Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces. |
|||
* Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed. |
|||
== Installation == |
|||
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus. |
|||
If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from [[User:Tino Didriksen]]'s [[Prerequisites for Debian|repository]]. |
|||
If you are using the [[Apertium VirtualBox]] image, all those tools will be already installed. |
|||
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the <code>apertium-tat</code> and <code>apertium-rus</code> directories in the <code>/languages</code> module. So you have to check out and compile these two monolingual packages first: |
|||
<pre> |
|||
mkdir languages |
|||
cd languages |
|||
git clone https://github.com/apertium/apertium-tat.git |
|||
cd apertium-tat |
|||
./autogen.sh |
|||
make |
|||
cd .. |
|||
git clone https://github.com/apertium/apertium-rus.git |
|||
cd apertium-rus |
|||
./autogen.sh |
|||
make |
|||
cd ../.. |
|||
</pre> |
|||
After you're done with that, you have to check out and compile <code>apertium-tat-rus</code> itself, specifying where monolingual packages you've just compiled are located: |
|||
<pre> |
|||
mkdir nursery |
|||
cd nursery |
|||
git clone https://github.com/apertium/apertium-tat-rus.git |
|||
cd apertium-tat-rus |
|||
./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ |
|||
make |
|||
</pre> |
|||
You can test the translator now: |
|||
<pre> |
|||
echo "Мин китап укыйм." | apertium -d . tat-rus |
|||
Я читаю книгу. |
|||
cd ../.. |
|||
mkdir trunk |
|||
cd trunk |
|||
git clone https://github.com/apertium/apertium-eval-translator.git |
|||
cd .. |
|||
cd nursery/apertium-tat-rus |
|||
./qa |
|||
</pre> |
|||
<code>./qa</code> runs the whole regression test suite. It requires <code>apertium-eval-translator.pl</code> script from trunk and assumes that your directory structure follows that of the apertium repository: |
|||
<pre> |
|||
.. |
|||
/ | \ |
|||
languages nursery trunk |
|||
/ \ \ | |
|||
apertium-tat apertium-rus apertium-tat-rus apertium-eval-translator |
|||
</pre> |
|||
== Workplan (GSoC 2014) == |
|||
This was a workplan for development efforts for the Tatar to Russian translator in [[Google Summer of Code]] 2014. |
|||
=== Major goals === |
|||
* Clean testvoc |
|||
* 10000 top stems in bidix and at least 80% trimmed coverage |
|||
* Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis. |
|||
* Average WER on unseen texts below 50 |
|||
=== Overview === |
|||
{|class=wikitable |
|||
|- |
|||
!colspan="2"| Weeks 1-6 !!colspan="2"| Weeks 7-12 !! Saturdays |
|||
|- |
|||
| get categor(y/ies) testvoc clean<br/>with one word -> |
|||
| <- add more stems to categor(y/ies)<br/>while preserving testvoc clean |
|||
| disambiguation |
|||
|lexical selection||rowspan="2" | adding stems & cleaning testvoc |
|||
|- |
|||
|colspan="4" style="text-align:center"| transfer rules for pending wiki tests (focus on phrases and clauses, not single words) |
|||
|} |
|||
=== Weekly schedule === |
|||
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014. |
|||
{|class=wikitable |
|||
|- |
|||
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan="2"| Target !! !!colspan="2"| Achieved !!rowspan="2"| Evaluation !!rowspan="2"| Notes |
|||
|- |
|||
! Testvoc (category, type) !! Stems !! !! Testvoc clean? !! Stems |
|||
|- |
|||
| 1 || 19/05—25/05 |
|||
| Nouns, lite || -- || ||align=center| ✓ || 236 || -- || |
|||
|- |
|||
| 2 || 26/05—01/06 |
|||
| Adjectives, lite || All nouns<br/>from tat.lexc || ||align=center| ✓ || 2141 || -- || Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. |
|||
|- |
|||
| 3 || 02/06—08/06 |
|||
| Verbs, lite || 4106 || ||align=center| 95.27% || 3258 || -- || Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt |
|||
|- |
|||
| 4 || 09/06—15/06 |
|||
| Adverbs, full || 6071 || ||align=center| ✓ || 5331 || -- || Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s). |
|||
|- |
|||
| 5 || 16/06—22/06 |
|||
| Numerals, full || 8036 || ||align=center| ✓ || 5488 || -- || |
|||
|- |
|||
| 6 || 23/06—29/06 |
|||
| Pronouns, full || 10000 || ||align=center| ✗|| 5529 || 1. 500 words x 2<br/>2. Try out assimilation evaluation toolkit if it's usable by that time. || '''Midterm evaluation'''<br/>Results when unknown word-marks (stars) are not removed<br/>tat-rus/texts/text1.* (full coverage):<br/>WER 66.73%, PER 56.48%<br/>tat-rus/texts/text2.* (not fully covered):<br/>WER 78.42%, PER 63.58% |
|||
|- |
|||
| 7 || 30/06—06/07 |
|||
|colspan="2"|Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) || ||colspan="2" style="text-align: center"| ✓ || -- || See apertium-tat/corpus/corpus.ana.txt |
|||
|- |
|||
| 8 || 07/07—13/07 |
|||
|colspan="2" rowspan="4"| Corpus testvoc clean on all of the available corpora ||rowspan="4"| ||rowspan="4" colspan="2" style="text-align: center"| ✗||rowspan="4"| -- ||rowspan="4"| Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above). |
|||
|- |
|||
| 9 || 14/07—20/07 |
|||
|- |
|||
| 10|| 21/07—27/07 |
|||
|- |
|||
| 11|| 28/07—03/08 |
|||
|- |
|||
| 12|| 04/08—10/08 |
|||
|colspan="2"| Write Constraint Grammar for Tatar || |
|||
|colspan="2" align=center| 111 rules || -- |
|||
| |
|||
<pre> |
|||
apertium-tat$ wc -l corpus/corpus.ana.txt |
|||
15090 corpus/corpus.ana.txt |
|||
apertium-tat$ ./qa cg |
|||
False negatives: 589; False positives: 126 |
|||
</pre> |
|||
|- |
|||
| 13 || 11/08—18/08 |
|||
| All categories, full || 10000 || ||align=center| ✗ || 6000|| [https://svn.code.sf.net/p/apertium/svn/branches/papers/2015-eamt-assim/tat-rus/ Gisting] || '''Final evaluation''' |
|||
|} |
|||
* Testvoc-lite (<code>apertium-tat-rus/testvoc/lite$ ./testvoc.sh</code>) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. |
|||
* Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc. |
|||
* Evaluation (except for gisting evaluation) is taking <math>n</math> words and performing an evaluation for post-edition word error rate (WER). The output for those <math>n</math> words should be clean. |
|||
==See also== |
|||
* [[/Pending tests|Pending tests]] |
|||
* [[/Regression tests|Regression tests]] |
|||
* [[Apertium-tat-rus/stats|Stats]] |
|||
[[Category:Tatar and Russian|*]] |
[[Category:Tatar and Russian|*]] |
Latest revision as of 12:48, 9 March 2018
This is a language pair translating from Tatar to Russian. The pair is currently located in GitHub.
Current state[edit]
Last updated | Testvoc (clean or not) | Corpus testvoc (no *, no */@, no */@/# errors) (coverage of the trimmed Tatar morphological analyser, coverage of the Tatar morphological analyser and of the bilingual dictionary, coverage of the whole translator) |
Stems in the bilingual dictionary | WER, PER on dev. corpus | Average WER, PER on unseen texts |
---|---|---|---|---|---|
13/03/2015 | No |
|
6000 | 71.03%, 54.02% | See GSoC2014 evaluation results below. |
- Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
- Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
- Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
- Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.
Installation[edit]
You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.
If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.
If you are using the Apertium VirtualBox image, all those tools will be already installed.
Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the apertium-tat
and apertium-rus
directories in the /languages
module. So you have to check out and compile these two monolingual packages first:
mkdir languages cd languages git clone https://github.com/apertium/apertium-tat.git cd apertium-tat ./autogen.sh make cd .. git clone https://github.com/apertium/apertium-rus.git cd apertium-rus ./autogen.sh make cd ../..
After you're done with that, you have to check out and compile apertium-tat-rus
itself, specifying where monolingual packages you've just compiled are located:
mkdir nursery cd nursery git clone https://github.com/apertium/apertium-tat-rus.git cd apertium-tat-rus ./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/ make
You can test the translator now:
echo "Мин китап укыйм." | apertium -d . tat-rus Я читаю книгу. cd ../.. mkdir trunk cd trunk git clone https://github.com/apertium/apertium-eval-translator.git cd .. cd nursery/apertium-tat-rus ./qa
./qa
runs the whole regression test suite. It requires apertium-eval-translator.pl
script from trunk and assumes that your directory structure follows that of the apertium repository:
.. / | \ languages nursery trunk / \ \ | apertium-tat apertium-rus apertium-tat-rus apertium-eval-translator
Workplan (GSoC 2014)[edit]
This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.
Major goals[edit]
- Clean testvoc
- 10000 top stems in bidix and at least 80% trimmed coverage
- Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
- Average WER on unseen texts below 50
Overview[edit]
Weeks 1-6 | Weeks 7-12 | Saturdays | ||
---|---|---|---|---|
get categor(y/ies) testvoc clean with one word -> |
<- add more stems to categor(y/ies) while preserving testvoc clean |
disambiguation | lexical selection | adding stems & cleaning testvoc |
transfer rules for pending wiki tests (focus on phrases and clauses, not single words) |
Weekly schedule[edit]
This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.
Week | Dates | Target | Achieved | Evaluation | Notes | |||
---|---|---|---|---|---|---|---|---|
Testvoc (category, type) | Stems | Testvoc clean? | Stems | |||||
1 | 19/05—25/05 | Nouns, lite | -- | ✓ | 236 | -- | ||
2 | 26/05—01/06 | Adjectives, lite | All nouns from tat.lexc |
✓ | 2141 | -- | Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well. | |
3 | 02/06—08/06 | Verbs, lite | 4106 | 95.27% | 3258 | -- | Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt | |
4 | 09/06—15/06 | Adverbs, full | 6071 | ✓ | 5331 | -- | Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s). | |
5 | 16/06—22/06 | Numerals, full | 8036 | ✓ | 5488 | -- | ||
6 | 23/06—29/06 | Pronouns, full | 10000 | ✗ | 5529 | 1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time. |
Midterm evaluation Results when unknown word-marks (stars) are not removed tat-rus/texts/text1.* (full coverage): WER 66.73%, PER 56.48% tat-rus/texts/text2.* (not fully covered): WER 78.42%, PER 63.58% | |
7 | 30/06—06/07 | Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later) | ✓ | -- | See apertium-tat/corpus/corpus.ana.txt | |||
8 | 07/07—13/07 | Corpus testvoc clean on all of the available corpora | ✗ | -- | Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above). | |||
9 | 14/07—20/07 | |||||||
10 | 21/07—27/07 | |||||||
11 | 28/07—03/08 | |||||||
12 | 04/08—10/08 | Write Constraint Grammar for Tatar | 111 rules | -- |
apertium-tat$ wc -l corpus/corpus.ana.txt 15090 corpus/corpus.ana.txt apertium-tat$ ./qa cg False negatives: 589; False positives: 126 | |||
13 | 11/08—18/08 | All categories, full | 10000 | ✗ | 6000 | Gisting | Final evaluation |
- Testvoc-lite (
apertium-tat-rus/testvoc/lite$ ./testvoc.sh
) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors. - Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
- Evaluation (except for gisting evaluation) is taking words and performing an evaluation for post-edition word error rate (WER). The output for those words should be clean.