Difference between revisions of "Tatar and Russian"

Revision as of 16:38, 13 March 2015

Current state

Last updated	Testvoc (clean or not)	Corpus testvoc (no , no /@, no */@/#)	Stems in bidix	WER, PER on dev. corpus	Average WER, PER on unseen texts
13/03/2015	No	news(86.3, 86.3, 82.3) wp(83.0, 83.0, 80.0) Quran(85.4, 85.4, 82.3) NT(83.0, 83.0, 80.0) aytmatov(90.1, 90.1 88.3)	6000	71.03%, 54.02%	See GSoC2014 evaluation results below.

Testvoc = apertium-tat-rus/testvoc/standard/testvoc.sh
Corpus testvoc = apertium-tat-rus/testvoc/corpus/trimmed-coverage.sh. Corpora can be found in the turkiccorpora repository.
- news = tat.news.2005-2011_300K-sentences.txt.bz2, wp = tat.wikipedia.2013-02-25.txt.bz2, NT = tat.NT.txt.bz2. Others are unambiguous.
Number of stems taken from the header "aperium-dixtools format-1line 10 55 apertium-tat-rus.tat-rus.dix" produces.
Development corpus = apertium-tat-rus/corpus (-test = tat-rus-nova.txt, -ref = tat-rus-posted.txt). WER / PER results are given when unknown-word marks (stars) are not removed.

Installation

You will need HFST, lttoolbox, vislcg3, apertium and apertium-lex-tools installed on your computer to be able to compile and use apertium-tat-rus.

If you are using a Debian-based distro, the easiest way to get those dependencies is to install them with apt-get from User:Tino Didriksen's repository.

If you are using the Apertium VirtualBox image, all those tools will be already installed.

Apertium-tat-rus fetches morphological transducers and CG-disambiguators of Tatar and Russian from the apertium-tat and apertium-rus directories in the /languages module. So you have to check out and compile these two monolingual packages first:

mkdir languages
cd languages
svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-tat/
cd apertium-tat
./autogen.sh
make
cd ..

svn co http://svn.code.sf.net/p/apertium/svn/languages/apertium-rus/
cd apertium-rus
./autogen.sh
make
cd ../..

After you're done with that, you have to check out and compile apertium-tat-rus itself, specifying where monolingual packages you've just compiled are located:

mkdir nursery
cd nursery
svn co http://svn.code.sf.net/p/apertium/svn/nursery/apertium-tat-rus/
cd apertium-tat-rus
./autogen.sh --with-lang1=../../languages/apertium-tat/ --with-lang2=../../languages/apertium-rus/
make

You can test the translator now:

echo "Мин китап укыйм." | apertium -d . tat-rus
Я читаю книгу.

cd ../..
mkdir trunk
cd trunk
svn co http://svn.code.sf.net/p/apertium/svn/trunk/apertium-eval-translator/
cd ..
cd nursery/apertium-tat-rus
./qa

./qa runs the whole regression test suite. It requires apertium-eval-translator.pl script from trunk and assumes that your directory structure follows that of the apertium repository:

                        ..
         /                   |                 \
    languages             nursery             trunk
    /        \                \                  |
apertium-tat  apertium-rus  apertium-tat-rus  apertium-eval-translator

Workplan (GSoC 2014)

This was a workplan for development efforts for the Tatar to Russian translator in Google Summer of Code 2014.

Major goals

Clean testvoc
10000 top stems in bidix and at least 80% trimmed coverage
Constraint grammar of Tatar containing at least 1000 rules, which makes 90-95% of all words unambiguous, with at least 95% retaining the correct analysis.
Average WER on unseen texts below 50

Overview

Weeks 1-6		Weeks 7-12		Saturdays
get categor(y/ies) testvoc clean with one word ->	<- add more stems to categor(y/ies) while preserving testvoc clean	disambiguation	lexical selection	adding stems & cleaning testvoc
transfer rules for pending wiki tests (focus on phrases and clauses, not single words)				adding stems & cleaning testvoc

Weekly schedule

This was a workplan for development efforts for Tatar-to-Russian translator in Google Summer of Code 2014.

Week	Dates	Target		Achieved		Evaluation	Notes
Week	Dates	Testvoc (category, type)	Stems	Testvoc clean?	Stems	Evaluation	Notes
1	19/05—25/05	Nouns, lite	--	✓	236	--
2	26/05—01/06	Adjectives, lite	All nouns from tat.lexc	✓	2141	--	Not all nouns were added. Around 500 are pending. They require adding stems to rus.dix as well.
3	02/06—08/06	Verbs, lite	4106	95.27%	3258	--	Adjectives, verbs added. Some more pending in apertium-rus/dev/to_[add/check].txt
4	09/06—15/06	Adverbs, full	6071	✓	5331	--	Adverbs clean when I comment out everything except adverbs in Root lexicon. 96% if I don't. Probably something else gets translated with adverb(s).
5	16/06—22/06	Numerals, full	8036	✓	5488	--
6	23/06—29/06	Pronouns, full	10000	✗	5529	1. 500 words x 2 2. Try out assimilation evaluation toolkit if it's usable by that time.	Midterm evaluation Results when unknown word-marks (stars) are not removed tat-rus/texts/text1.* (full coverage): WER 66.73%, PER 56.48% tat-rus/texts/text2.* (not fully covered): WER 78.42%, PER 63.58%
7	30/06—06/07	Manually disambiguate a Tatar corpus (in a way so that it will be usable in the cg3ide later)		✓		--	See apertium-tat/texts/corpus.ana.txt
8	07/07—13/07	Corpus testvoc clean on all of the available corpora		✗		--	Difference between error-free coverage of the analyser and error-free coverage of the full translator is between 1.6% and 4% (see stats above).
9	14/07—20/07
10	21/07—27/07
11	28/07—03/08
12	04/08—10/08	Write Constraint Grammar for Tatar				--
13	11/08—18/08	All categories, full	10000			Gisting	Final evaluation

Testvoc-lite (apertium-tat-rus/testvoc/lite$ ./testvoc.sh) for a category means taking one word per each sub-category and making the full paradigm of the word pass through the translator without debug symbols. E.g. "testvoc-lite for nouns clean" means that if we leave only one word in each of the N1, N-COMPOUND-PX etc. lexicons, they will pass through translator without [@#] errors.
Till midterm, bilingual dictionary and, if necessary, Russian transducer are expanded with translations of stems taken from apertium-tat.tat.lexc.
Evaluation (except for gisting evaluation) is taking $n$ words and performing an evaluation for post-edition word error rate (WER). The output for those $n$ words should be clean.

@@ Line 142: / Line 142: @@
 |-
 | 6 || 23/06&mdash;29/06
-| Pronouns, full || 10000 || || ||align=center| || 1. 500 words x 2<br/>2. Try out assimilation evaluation toolkit if it's usable by that time. || '''Midterm evaluation'''<br/>Results when unknown word-marks (stars) are not removed<br/>tat-rus/texts/text1.* (full coverage):<br/>WER 66.73%, PER 56.48%<br/>tat-rus/texts/text2.* (not fully covered):<br/>WER 78.42%, PER 63.58%
+| Pronouns, full || 10000 || ||align=center| ✗|| 5529 || 1. 500 words x 2<br/>2. Try out assimilation evaluation toolkit if it's usable by that time. || '''Midterm evaluation'''<br/>Results when unknown word-marks (stars) are not removed<br/>tat-rus/texts/text1.* (full coverage):<br/>WER 66.73%, PER 56.48%<br/>tat-rus/texts/text2.* (not fully covered):<br/>WER 78.42%, PER 63.58%
 |-
 | 7 || 30/06&mdash;06/07

Difference between revisions of "Tatar and Russian"

Revision as of 16:38, 13 March 2015

Contents

Current state

Installation

Workplan (GSoC 2014)

Major goals

Overview

Weekly schedule

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools