Difference between revisions of "Crimean Tatar and Turkish/Work plan"
Jump to navigation
Jump to search
| (10 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
What [[User:IlnarSalimzyan|selimcan]] expects: |
|||
* [[Calculating coverage|Bidix-trimmed coverage]] 90% on average. |
|||
* [[Testvoc#Corpus testvoc|Corpus testvoc]] clean on all Crimean Tatar corpora we have. |
|||
* Tests in [[Crimean Tatar and Turkish/Pending tests|Pending tests]] pass and thus are moved to [[Crimean Tatar and Turkish/Regression tests|Regression tests]] |
|||
{|class=wikitable |
{|class=wikitable |
||
! Week !! Dates !! Coverage !! Achieved !! Evaluation |
! Week !! Dates !! Coverage !! Achieved !! Evaluation |
||
|- |
|- |
||
| 3 ||22nd May — 28th May || 40% || 43.9% || |
| 3 ||22nd May — 28th May || 40% || 43.9% || '''✔''' |
||
|- |
|- |
||
| Line 21: | Line 15: | ||
|- |
|- |
||
| 4 ||29th May — 4th June || 40% || || |
| 4 ||29th May — 4th June || 40% || || '''✔''' |
||
|- |
|- |
||
| Line 27: | Line 21: | ||
|- |
|- |
||
| 5 ||5th June — 11th June || 65% || || |
| 5 ||5th June — 11th June || 65% || || '''✔''' |
||
|- |
|- |
||
| Line 33: | Line 27: | ||
|- |
|- |
||
| 6 ||12th June — 18th June || |
| 6 ||12th June — 18th June || 75% || || '''✔''' |
||
|- |
|- |
||
| Line 57: | Line 51: | ||
|- |
|- |
||
| 9 ||3rd July — 9th July || |
| 9 ||3rd July — 9th July || 84% || || |
||
|- |
|- |
||
| Line 136: | Line 130: | ||
|- |
|- |
||
|} |
|} |
||
=== Coverage === |
|||
To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>: |
|||
<pre> |
|||
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \ |
|||
bash testvoc/corpus/trimmed-coverage.sh | less |
|||
Number of tokenised words in the corpus: 148013 |
|||
Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had * |
|||
unknown to bidix: 112 — 0.1 % of tokens had @ |
|||
w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had # |
|||
Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no * |
|||
Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@ |
|||
Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/# |
|||
Top unknown words in the corpus: |
|||
972 ^*Ukrainanıñ$ |
|||
939 ^*vilâyetinde$ |
|||
631 ^*şeklinde$ |
|||
607 ^*qasaba$ |
|||
508 ^*merkezi$ |
|||
434 ^*rayonınıñ$ |
|||
329 ^*da$ |
|||
283 ^*de$ |
|||
235 ^*adı$ |
|||
221 ^*vilâyeti$ |
|||
Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037 |
|||
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt |
|||
^Baş<n><nom>$ Baş |
|||
^*Saife$ *Saife |
|||
... |
|||
</pre> |
|||
=== Testvoc === |
|||
Requirements for testvoc in week 1: |
|||
# all pronouns from Crimean Tatar corpora are translated without debug symbols |
|||
# all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1) |
|||
To achieve 1: |
|||
* analyse corpora with crh-morph mode |
|||
* grep pronouns |
|||
* make sure they pass through the rest of the pipeline without getting @ or # |
|||
To achieve 2: |
|||
* in 'Root' lexicon of the .lexc file, comment out everything except Pronouns |
|||
* generate pronouns with <code>hfst-fst2string crh.automorf.hfst</code> |
|||
* make sure they pass through the rest of the pipeline without getting @ or # |
|||
We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first. |
|||
[[Category:Crimean Tatar and Turkish|Work plan]] |
[[Category:Crimean Tatar and Turkish|Work plan]] |
||
Latest revision as of 18:27, 19 June 2017
| Week | Dates | Coverage | Achieved | Evaluation |
|---|---|---|---|---|
| 3 | 22nd May — 28th May | 40% | 43.9% | ✔ |
| * Add all non-inflecting words | ||||
| * Finish challenge text (no *,#) | ||||
| * Do baseline evaluation (WER) | ||||
| Official start | ||||
| 4 | 29th May — 4th June | 40% | ✔ | |
| * Break | ||||
| 5 | 5th June — 11th June | 65% | ✔ | |
| * ? | ||||
| 6 | 12th June — 18th June | 75% | ✔ | |
| * ? | ||||
| * ? | ||||
| 7 | 19th June — 25th June | 80% | ||
| Phase 1 evaluation | ||||
| Deliverable: All closed classes + numerals testvoc clean | ||||
| 8 | 26th June — 2nd July | 84% | ||
| * ? | ||||
| * ? | ||||
| 9 | 3rd July — 9th July | 84% | ||
| * ? | ||||
| 10 | 10th July — 16th July | 84% | ||
| * ? | ||||
| * ? | ||||
| 11 | 17th July — 23rd July | 86% | ||
| Phase 2 evaluation | ||||
| Deliverable: Nouns, adjectives testvoc clean | ||||
| * ? | ||||
| 12 | 24th July — 30th July | 88% | ||
| * ? | ||||
| 13 | 1st August — 6th August | 89% | ||
| * ? | ||||
| 14 | 7th August — 13th August | 90% | ||
| * ? | ||||
| 15 | 14th August — 20th August | 91% | ||
| * ? | ||||
| 16 | 21th August — 27th August | 92% | ||
| Final evaluation | ||||
| Final deliverable: Full MT system, testvoc clean. | ||||
| * Evaluation | ||||
| * Write paper | ||||
| 17 | 28th August — 3rd September | |||
| * Write paper | ||||
| 18 | 4th September — 6th September | |||
| * Write paper | ||||
Coverage[edit]
To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh:
apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
bash testvoc/corpus/trimmed-coverage.sh | less
Number of tokenised words in the corpus: 148013
Number of tokenised words unknown to analyser: 63730 — 43.1 % of tokens had *
unknown to bidix: 112 — 0.1 % of tokens had @
w/transfer errors or unknown to generator: 2473 — 1.7 % of tokens had #
Error-free coverage of analyser only: 84283 — 56.9 % of tokens had no *
Error-free coverage of analyser and bidix: 84171 — 56.9 % of tokens had no */@
Error-free coverage of the full translator: 81698 — 55.2 % of tokens had no */@/#
Top unknown words in the corpus:
972 ^*Ukrainanıñ$
939 ^*vilâyetinde$
631 ^*şeklinde$
607 ^*qasaba$
508 ^*merkezi$
434 ^*rayonınıñ$
329 ^*da$
283 ^*de$
235 ^*adı$
221 ^*vilâyeti$
Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
^Baş<n><nom>$ Baş
^*Saife$ *Saife
...
Testvoc[edit]
Requirements for testvoc in week 1:
- all pronouns from Crimean Tatar corpora are translated without debug symbols
- all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)
To achieve 1:
- analyse corpora with crh-morph mode
- grep pronouns
- make sure they pass through the rest of the pipeline without getting @ or #
To achieve 2:
- in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
- generate pronouns with
hfst-fst2string crh.automorf.hfst - make sure they pass through the rest of the pipeline without getting @ or #
We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.