Difference between revisions of "Crimean Tatar and Turkish/Work plan"

From Apertium
Jump to navigation Jump to search
Line 6: Line 6:
** [[WER]] < 25% in both directions.
** [[WER]] < 25% in both directions.


== Weekly Schedule ==

{|class=wikitable
|-
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan="3"| Target !! !!colspan="3"| Achieved !!rowspan="2"| Evaluation
|-
! crh-tur cov. !! tur-crh cov. !! testvoc !! !! crh-tur cov. !! tur-crh cov. !! testvoc
|-
| 1 || 07/06&mdash;11/06
| 65% || 65% || pronouns || || || || ||
|-
|-
| 12 || 21/08&mdash;27/08
| 90% || 90% || all categories || || || || ||

|}

=== Coverage ===

To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>:

<pre>
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus: 148013
Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had *
unknown to bidix: 112 — 0.1 % of tokens had @
w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had #

Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no *
Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@
Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/#

Top unknown words in the corpus:
972 ^*Ukrainanıñ$
939 ^*vilâyetinde$
631 ^*şeklinde$
607 ^*qasaba$
508 ^*merkezi$
434 ^*rayonınıñ$
329 ^*da$
283 ^*de$
235 ^*adı$
221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
^Baş<n><nom>$ Baş
^*Saife$ *Saife

...
</pre>

=== Testvoc ===

Requirements for testvoc in week 1:

# all pronouns from Wikipedia are translated without debug symbols
# all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

* analyse corpora with crh-morph/tur-morph mode
* grep pronouns
* make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

* in 'Root' lexicon of the .lexc files, comment out everything except Pronouns
* generate pronouns with <code>hfst-fst2string crh/tur.automorf.hfst</code>
* make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which are probably over-generated by the transducers. This is the reason why we focus on 1 first.

----

This plan below might change.


{|class=wikitable
{|class=wikitable
Line 218: Line 138:
|-
|-
|}
|}

=== Coverage ===

To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>:

<pre>
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus: 148013
Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had *
unknown to bidix: 112 — 0.1 % of tokens had @
w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had #

Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no *
Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@
Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/#

Top unknown words in the corpus:
972 ^*Ukrainanıñ$
939 ^*vilâyetinde$
631 ^*şeklinde$
607 ^*qasaba$
508 ^*merkezi$
434 ^*rayonınıñ$
329 ^*da$
283 ^*de$
235 ^*adı$
221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
^Baş<n><nom>$ Baş
^*Saife$ *Saife

...
</pre>

=== Testvoc ===

Requirements for testvoc in week 1:

# all pronouns from Crimean Tatar corpora are translated without debug symbols
# all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

* analyse corpora with crh-morph mode
* grep pronouns
* make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

* in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
* generate pronouns with <code>hfst-fst2string crh.automorf.hfst</code>
* make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.


[[Category:Crimean Tatar and Turkish|Work plan]]
[[Category:Crimean Tatar and Turkish|Work plan]]

Revision as of 17:08, 7 June 2017

What selimcan expects:

  • a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles, with:


Week Dates Coverage Achieved Evaluation
3 22nd May — 28th May 40% 43.9%
* Add all non-inflecting words
* Finish challenge text (no *,#)
* Do baseline evaluation (WER)
Official start
4 29th May — 4th June 40%
* Break
5 5th June — 11th June 65%
* ?
6 12th June — 18th June 70%
* ?
* ?
7 19th June — 25th June 80%
Phase 1 evaluation
Deliverable: All closed classes + numerals testvoc clean
8 26th June — 2nd July 84%
* ?
* ?
9 3rd July — 9th July 82%
* ?
10 10th July — 16th July 84%
* ?
* ?
11 17th July — 23rd July 86%
Phase 2 evaluation
Deliverable: Nouns, adjectives testvoc clean
* ?
12 24th July — 30th July 88%
* ?
13 1st August — 6th August 89%
* ?
14 7th August — 13th August 90%
* ?
15 14th August — 20th August 91%
* ?
16 21th August — 27th August 92%
Final evaluation
Final deliverable: Full MT system, testvoc clean.
* Evaluation
* Write paper
17 28th August — 3rd September
* Write paper
18 4th September — 6th September
* Write paper

Coverage

To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh:

apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
                  bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus:         148013
Number of tokenised words unknown to analyser:    63730  —  43.1 % of tokens had *
                          unknown to bidix:         112  —   0.1 % of tokens had @
     w/transfer errors or unknown to generator:    2473  —   1.7 % of tokens had #

Error-free coverage of analyser only:             84283  —  56.9 % of tokens had no *
Error-free coverage of analyser and bidix:        84171  —  56.9 % of tokens had no */@
Error-free coverage of the full translator:       81698  —  55.2 % of tokens had no */@/#

Top unknown words in the corpus:
    972 ^*Ukrainanıñ$
    939 ^*vilâyetinde$
    631 ^*şeklinde$
    607 ^*qasaba$
    508 ^*merkezi$
    434 ^*rayonınıñ$
    329 ^*da$
    283 ^*de$
    235 ^*adı$
    221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
        
        
^Baş<n><nom>$   Baş
^*Saife$        *Saife

...

Testvoc

Requirements for testvoc in week 1:

  1. all pronouns from Crimean Tatar corpora are translated without debug symbols
  2. all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

  • analyse corpora with crh-morph mode
  • grep pronouns
  • make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

  • in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
  • generate pronouns with hfst-fst2string crh.automorf.hfst
  • make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.