Difference between revisions of "Crimean Tatar and Turkish/Work plan"

From Apertium
Jump to navigation Jump to search
Line 5: Line 5:
** [[Testvoc#Corpus testvoc|Wikipedia-corpus-testvoc]] and single-stem-per-lexicon-testvoc clean in both directions,
** [[Testvoc#Corpus testvoc|Wikipedia-corpus-testvoc]] and single-stem-per-lexicon-testvoc clean in both directions,
** [[WER]] < 25% in both directions.
** [[WER]] < 25% in both directions.

== Weekly Schedule ==


{|class=wikitable
{|class=wikitable
Line 20: Line 22:


|}
|}

=== Coverage ===

To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>:

<pre>
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus: 148013
Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had *
unknown to bidix: 112 — 0.1 % of tokens had @
w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had #

Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no *
Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@
Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/#

Top unknown words in the corpus:
972 ^*Ukrainanıñ$
939 ^*vilâyetinde$
631 ^*şeklinde$
607 ^*qasaba$
508 ^*merkezi$
434 ^*rayonınıñ$
329 ^*da$
283 ^*de$
235 ^*adı$
221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
^Baş<n><nom>$ Baş
^*Saife$ *Saife

...
</pre>

=== Testvoc ===


Requirements for testvoc in week 1:
Requirements for testvoc in week 1:


# all pronouns from Wikipedia corpora are translated without debug symbols
# all pronouns from Wikipedia are translated without debug symbols
# all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1)
# all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1)


Line 42: Line 85:
----
----


This plan below might change later.
This plan below might change.


{|class=wikitable
{|class=wikitable

Revision as of 01:37, 7 June 2017

What selimcan expects:

  • a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles, with:

Weekly Schedule

Week Dates Target Achieved Evaluation
crh-tur cov. tur-crh cov. testvoc crh-tur cov. tur-crh cov. testvoc
1 07/06—11/06 65% 65% pronouns
12 21/08—27/08 90% 90% all categories

Coverage

To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh:

apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
                  bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus:         148013
Number of tokenised words unknown to analyser:    63730  —  43.1 % of tokens had *
                          unknown to bidix:         112  —   0.1 % of tokens had @
     w/transfer errors or unknown to generator:    2473  —   1.7 % of tokens had #

Error-free coverage of analyser only:             84283  —  56.9 % of tokens had no *
Error-free coverage of analyser and bidix:        84171  —  56.9 % of tokens had no */@
Error-free coverage of the full translator:       81698  —  55.2 % of tokens had no */@/#

Top unknown words in the corpus:
    972 ^*Ukrainanıñ$
    939 ^*vilâyetinde$
    631 ^*şeklinde$
    607 ^*qasaba$
    508 ^*merkezi$
    434 ^*rayonınıñ$
    329 ^*da$
    283 ^*de$
    235 ^*adı$
    221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
        
        
^Baş<n><nom>$   Baş
^*Saife$        *Saife

...

Testvoc

Requirements for testvoc in week 1:

  1. all pronouns from Wikipedia are translated without debug symbols
  2. all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

  • analyse corpora with crh-morph/tur-morph mode
  • grep pronouns
  • make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

  • in 'Root' lexicon of the .lexc files, comment out everything except Pronouns
  • generate pronouns with hfst-fst2string crh/tur.automorf.hfst
  • make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which are probably over-generated by the transducers. This is the reason why we focus on 1 first.


This plan below might change.

Week Dates Coverage Achieved Evaluation
3 22nd May — 28th May 40% 43.9%
* Add all non-inflecting words
* Finish challenge text (no *,#)
* Do baseline evaluation (WER)
Official start
4 29th May — 4th June 40%
* Break
5 5th June — 11th June 65%
* ?
6 12th June — 18th June 70%
* ?
* ?
7 19th June — 25th June 80%
Phase 1 evaluation
Deliverable: All closed classes + numerals testvoc clean
8 26th June — 2nd July 84%
* ?
* ?
9 3rd July — 9th July 82%
* ?
10 10th July — 16th July 84%
* ?
* ?
11 17th July — 23rd July 86%
Phase 2 evaluation
Deliverable: Nouns, adjectives testvoc clean
* ?
12 24th July — 30th July 88%
* ?
13 1st August — 6th August 89%
* ?
14 7th August — 13th August 90%
* ?
15 14th August — 20th August 91%
* ?
16 21th August — 27th August 92%
Final evaluation
Final deliverable: Full MT system, testvoc clean.
* Evaluation
* Write paper
17 28th August — 3rd September
* Write paper
18 4th September — 6th September
* Write paper