Difference between revisions of "Crimean Tatar and Turkish/Work plan"

Revision as of 17:08, 7 June 2017

What selimcan expects:

a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles, with:
- >90% bidix-trimmed coverage on both Wikipedias,
- Wikipedia-corpus-testvoc and single-stem-per-lexicon-testvoc clean in both directions,
- WER < 25% in both directions.

Week	Dates	Coverage	Achieved
3	22nd May — 28th May	40%	43.9%
* Add all non-inflecting words
* Finish challenge text (no *,#)
* Do baseline evaluation (WER)
Official start
4	29th May — 4th June	40%
* Break
5	5th June — 11th June	65%
* ?
6	12th June — 18th June	70%
* ?
* ?
7	19th June — 25th June	80%
Phase 1 evaluation
Deliverable: All closed classes + numerals testvoc clean
8	26th June — 2nd July	84%
* ?
* ?
9	3rd July — 9th July	82%
* ?
10	10th July — 16th July	84%
* ?
* ?
11	17th July — 23rd July	86%
Phase 2 evaluation
Deliverable: Nouns, adjectives testvoc clean
* ?
12	24th July — 30th July	88%
* ?
13	1st August — 6th August	89%
* ?
14	7th August — 13th August	90%
* ?
15	14th August — 20th August	91%
* ?
16	21th August — 27th August	92%
Final evaluation
Final deliverable: Full MT system, testvoc clean.
* Evaluation
* Write paper
17	28th August — 3rd September
* Write paper
18	4th September — 6th September
* Write paper

Coverage

To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh:

apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
                  bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus:         148013
Number of tokenised words unknown to analyser:    63730  —  43.1 % of tokens had *
                          unknown to bidix:         112  —   0.1 % of tokens had @
     w/transfer errors or unknown to generator:    2473  —   1.7 % of tokens had #

Error-free coverage of analyser only:             84283  —  56.9 % of tokens had no *
Error-free coverage of analyser and bidix:        84171  —  56.9 % of tokens had no */@
Error-free coverage of the full translator:       81698  —  55.2 % of tokens had no */@/#

Top unknown words in the corpus:
    972 ^*Ukrainanıñ$
    939 ^*vilâyetinde$
    631 ^*şeklinde$
    607 ^*qasaba$
    508 ^*merkezi$
    434 ^*rayonınıñ$
    329 ^*da$
    283 ^*de$
    235 ^*adı$
    221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
        
        
^Baş<n><nom>$   Baş
^*Saife$        *Saife

...

Testvoc

Requirements for testvoc in week 1:

all pronouns from Crimean Tatar corpora are translated without debug symbols
all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

analyse corpora with crh-morph mode
grep pronouns
make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
generate pronouns with hfst-fst2string crh.automorf.hfst
make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.

Difference between revisions of "Crimean Tatar and Turkish/Work plan"

Revision as of 17:08, 7 June 2017

Coverage

Testvoc

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools