Difference between revisions of "Crimean Tatar and Turkish/Work plan"
Jump to navigation
Jump to search
(5 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | What [[User:Ilnar.salimzyan|selimcan]] expects: |
||
− | |||
− | * '''a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles''', with: |
||
− | ** >90% [[Calculating coverage|bidix-trimmed coverage]] on both Wikipedias, |
||
− | ** [[Testvoc#Corpus testvoc|Wikipedia-corpus-testvoc]] and single-stem-per-lexicon-testvoc clean in both directions, |
||
− | ** [[WER]] < 25% in both directions. |
||
− | |||
− | {|class=wikitable |
||
− | |- |
||
− | !rowspan="2"| Week !!rowspan="2"| Dates !!colspan="3"| Target !! !!colspan="3"| Achieved !!rowspan="2"| Evaluation |
||
− | |- |
||
− | ! crh-tur cov. !! tur-crh cov. !! testvoc !! !! crh-tur cov. !! tur-crh cov. !! testvoc |
||
− | |- |
||
− | | 1 || 07/06—11/06 |
||
− | | 65% || 65% || pronouns || || || || || |
||
− | |- |
||
− | |- |
||
− | | 12 || 21/08—27/08 |
||
− | | 90% || 90% || all categories || || || || || |
||
− | |||
− | |} |
||
− | |||
− | Requirements for testvoc in week 1: |
||
− | |||
− | # all pronouns from Wikipedia corpora are translated without debug symbols |
||
− | # all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1) |
||
− | |||
− | To achieve 1: |
||
− | |||
− | * analyse corpora with crh-morph/tur-morph mode |
||
− | * grep pronouns |
||
− | * make sure they pass through the rest of the pipeline without getting @ or # |
||
− | |||
− | To achieve 2: |
||
− | |||
− | * in 'Root' lexicon of the .lexc files, comment out everything except Pronouns |
||
− | * generate pronouns with <code>hfst-fst2string crh/tur.automorf.hfst</code> |
||
− | * make sure they pass through the rest of the pipeline without getting @ or # |
||
− | |||
− | We don't want to spend too much time on forms which are probably over-generated by the transducers. This is the reason why we focus on 1 first. |
||
− | |||
− | ---- |
||
− | |||
− | This plan below might change later. |
||
− | |||
{|class=wikitable |
{|class=wikitable |
||
! Week !! Dates !! Coverage !! Achieved !! Evaluation |
! Week !! Dates !! Coverage !! Achieved !! Evaluation |
||
|- |
|- |
||
− | | 3 ||22nd May — 28th May || 40% || 43.9% || |
+ | | 3 ||22nd May — 28th May || 40% || 43.9% || '''✔''' |
|- |
|- |
||
Line 60: | Line 15: | ||
|- |
|- |
||
− | | 4 ||29th May — 4th June || 40% || || |
+ | | 4 ||29th May — 4th June || 40% || || '''✔''' |
|- |
|- |
||
Line 66: | Line 21: | ||
|- |
|- |
||
− | | 5 ||5th June — 11th June || 65% || || |
+ | | 5 ||5th June — 11th June || 65% || || '''✔''' |
|- |
|- |
||
Line 72: | Line 27: | ||
|- |
|- |
||
− | | 6 ||12th June — 18th June || |
+ | | 6 ||12th June — 18th June || 75% || || '''✔''' |
|- |
|- |
||
Line 96: | Line 51: | ||
|- |
|- |
||
− | | 9 ||3rd July — 9th July || |
+ | | 9 ||3rd July — 9th July || 84% || || |
|- |
|- |
||
Line 175: | Line 130: | ||
|- |
|- |
||
|} |
|} |
||
+ | |||
+ | === Coverage === |
||
+ | |||
+ | To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>: |
||
+ | |||
+ | <pre> |
||
+ | apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \ |
||
+ | bash testvoc/corpus/trimmed-coverage.sh | less |
||
+ | |||
+ | Number of tokenised words in the corpus: 148013 |
||
+ | Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had * |
||
+ | unknown to bidix: 112 — 0.1 % of tokens had @ |
||
+ | w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had # |
||
+ | |||
+ | Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no * |
||
+ | Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@ |
||
+ | Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/# |
||
+ | |||
+ | Top unknown words in the corpus: |
||
+ | 972 ^*Ukrainanıñ$ |
||
+ | 939 ^*vilâyetinde$ |
||
+ | 631 ^*şeklinde$ |
||
+ | 607 ^*qasaba$ |
||
+ | 508 ^*merkezi$ |
||
+ | 434 ^*rayonınıñ$ |
||
+ | 329 ^*da$ |
||
+ | 283 ^*de$ |
||
+ | 235 ^*adı$ |
||
+ | 221 ^*vilâyeti$ |
||
+ | |||
+ | Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037 |
||
+ | Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt |
||
+ | |||
+ | |||
+ | ^Baş<n><nom>$ Baş |
||
+ | ^*Saife$ *Saife |
||
+ | |||
+ | ... |
||
+ | </pre> |
||
+ | |||
+ | === Testvoc === |
||
+ | |||
+ | Requirements for testvoc in week 1: |
||
+ | |||
+ | # all pronouns from Crimean Tatar corpora are translated without debug symbols |
||
+ | # all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1) |
||
+ | |||
+ | To achieve 1: |
||
+ | |||
+ | * analyse corpora with crh-morph mode |
||
+ | * grep pronouns |
||
+ | * make sure they pass through the rest of the pipeline without getting @ or # |
||
+ | |||
+ | To achieve 2: |
||
+ | |||
+ | * in 'Root' lexicon of the .lexc file, comment out everything except Pronouns |
||
+ | * generate pronouns with <code>hfst-fst2string crh.automorf.hfst</code> |
||
+ | * make sure they pass through the rest of the pipeline without getting @ or # |
||
+ | |||
+ | We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first. |
||
[[Category:Crimean Tatar and Turkish|Work plan]] |
[[Category:Crimean Tatar and Turkish|Work plan]] |
Latest revision as of 18:27, 19 June 2017
Week | Dates | Coverage | Achieved | Evaluation |
---|---|---|---|---|
3 | 22nd May — 28th May | 40% | 43.9% | ✔ |
* Add all non-inflecting words | ||||
* Finish challenge text (no *,#) | ||||
* Do baseline evaluation (WER) | ||||
Official start | ||||
4 | 29th May — 4th June | 40% | ✔ | |
* Break | ||||
5 | 5th June — 11th June | 65% | ✔ | |
* ? | ||||
6 | 12th June — 18th June | 75% | ✔ | |
* ? | ||||
* ? | ||||
7 | 19th June — 25th June | 80% | ||
Phase 1 evaluation | ||||
Deliverable: All closed classes + numerals testvoc clean | ||||
8 | 26th June — 2nd July | 84% | ||
* ? | ||||
* ? | ||||
9 | 3rd July — 9th July | 84% | ||
* ? | ||||
10 | 10th July — 16th July | 84% | ||
* ? | ||||
* ? | ||||
11 | 17th July — 23rd July | 86% | ||
Phase 2 evaluation | ||||
Deliverable: Nouns, adjectives testvoc clean | ||||
* ? | ||||
12 | 24th July — 30th July | 88% | ||
* ? | ||||
13 | 1st August — 6th August | 89% | ||
* ? | ||||
14 | 7th August — 13th August | 90% | ||
* ? | ||||
15 | 14th August — 20th August | 91% | ||
* ? | ||||
16 | 21th August — 27th August | 92% | ||
Final evaluation | ||||
Final deliverable: Full MT system, testvoc clean. | ||||
* Evaluation | ||||
* Write paper | ||||
17 | 28th August — 3rd September | |||
* Write paper | ||||
18 | 4th September — 6th September | |||
* Write paper |
Coverage[edit]
To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh
:
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \ bash testvoc/corpus/trimmed-coverage.sh | less Number of tokenised words in the corpus: 148013 Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had * unknown to bidix: 112 — 0.1 % of tokens had @ w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had # Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no * Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@ Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/# Top unknown words in the corpus: 972 ^*Ukrainanıñ$ 939 ^*vilâyetinde$ 631 ^*şeklinde$ 607 ^*qasaba$ 508 ^*merkezi$ 434 ^*rayonınıñ$ 329 ^*da$ 283 ^*de$ 235 ^*adı$ 221 ^*vilâyeti$ Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037 Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt ^Baş<n><nom>$ Baş ^*Saife$ *Saife ...
Testvoc[edit]
Requirements for testvoc in week 1:
- all pronouns from Crimean Tatar corpora are translated without debug symbols
- all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)
To achieve 1:
- analyse corpora with crh-morph mode
- grep pronouns
- make sure they pass through the rest of the pipeline without getting @ or #
To achieve 2:
- in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
- generate pronouns with
hfst-fst2string crh.automorf.hfst
- make sure they pass through the rest of the pipeline without getting @ or #
We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.