Difference between revisions of "Crimean Tatar and Turkish/Work plan"
Jump to navigation
Jump to search
Line 3: | Line 3: | ||
* '''a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles''', with: |
* '''a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles''', with: |
||
** >90% [[Calculating coverage|bidix-trimmed coverage]] on both Wikipedias, |
** >90% [[Calculating coverage|bidix-trimmed coverage]] on both Wikipedias, |
||
** |
** [[Testvoc#Corpus testvoc|Wikipedia-corpus-testvoc]] and single-stem-per-lexicon-testvoc clean in both directions, |
||
** [[WER]] < 25% in both directions. |
** [[WER]] < 25% in both directions. |
||
{|class=wikitable |
{|class=wikitable |
||
|- |
|- |
||
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan=" |
!rowspan="2"| Week !!rowspan="2"| Dates !!colspan="3"| Target !! !!colspan="3"| Achieved !!rowspan="2"| Evaluation |
||
|- |
|- |
||
! crh-tur cov. !! tur-crh cov. !! !! crh-tur cov. !! tur-crh cov. |
! crh-tur cov. !! tur-crh cov. !! testvoc !! !! crh-tur cov. !! tur-crh cov. !! testvoc |
||
|- |
|- |
||
| 1 || 07/06—11/06 |
| 1 || 07/06—11/06 |
||
| 65% || 65% || || || || || |
| 65% || 65% || pronouns || || || || || |
||
|- |
|- |
||
|- |
|- |
||
| 12 || 21/08—27/08 |
| 12 || 21/08—27/08 |
||
| 90% || 90% || || || || || |
| 90% || 90% || all categories || || || || || |
||
|} |
|} |
||
Requirements for testvoc in week 1: |
|||
# all pronouns from Wikipedia corpora are translated without debug symbols |
|||
# all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1) |
|||
To achieve 1: |
|||
* analyse corpora with crh-morph/tur-morph mode |
|||
* grep pronouns |
|||
* make sure they pass through the rest of the pipeline without getting @ or # |
|||
To achieve 2: |
|||
* in 'Root' lexicon of the .lexc files, comment out everything except Pronouns |
|||
* generate pronouns with <code>hfst-fst2string crh/tur.automorf.hfst</code> |
|||
* make sure they pass through the rest of the pipeline without getting @ or # |
|||
We don't want to spend too much time on forms which are probably over-generated by the transducers. This is the reason why we focus on 1 first. |
|||
---- |
|||
This plan below might change later. |
|||
{|class=wikitable |
{|class=wikitable |
Revision as of 00:56, 7 June 2017
What selimcan expects:
- a bidirectional Crimean Tatar-Turkish translator for translating Wikipedia articles, with:
- >90% bidix-trimmed coverage on both Wikipedias,
- Wikipedia-corpus-testvoc and single-stem-per-lexicon-testvoc clean in both directions,
- WER < 25% in both directions.
Week | Dates | Target | Achieved | Evaluation | |||||
---|---|---|---|---|---|---|---|---|---|
crh-tur cov. | tur-crh cov. | testvoc | crh-tur cov. | tur-crh cov. | testvoc | ||||
1 | 07/06—11/06 | 65% | 65% | pronouns | |||||
12 | 21/08—27/08 | 90% | 90% | all categories |
Requirements for testvoc in week 1:
- all pronouns from Wikipedia corpora are translated without debug symbols
- all pronouns transducers generate must pass without debug symbols (this is less important, and only to focus on if done with 1)
To achieve 1:
- analyse corpora with crh-morph/tur-morph mode
- grep pronouns
- make sure they pass through the rest of the pipeline without getting @ or #
To achieve 2:
- in 'Root' lexicon of the .lexc files, comment out everything except Pronouns
- generate pronouns with
hfst-fst2string crh/tur.automorf.hfst
- make sure they pass through the rest of the pipeline without getting @ or #
We don't want to spend too much time on forms which are probably over-generated by the transducers. This is the reason why we focus on 1 first.
This plan below might change later.
Week | Dates | Coverage | Achieved | Evaluation |
---|---|---|---|---|
3 | 22nd May — 28th May | 40% | 43.9% | |
* Add all non-inflecting words | ||||
* Finish challenge text (no *,#) | ||||
* Do baseline evaluation (WER) | ||||
Official start | ||||
4 | 29th May — 4th June | 40% | ||
* Break | ||||
5 | 5th June — 11th June | 65% | ||
* ? | ||||
6 | 12th June — 18th June | 70% | ||
* ? | ||||
* ? | ||||
7 | 19th June — 25th June | 80% | ||
Phase 1 evaluation | ||||
Deliverable: All closed classes + numerals testvoc clean | ||||
8 | 26th June — 2nd July | 84% | ||
* ? | ||||
* ? | ||||
9 | 3rd July — 9th July | 82% | ||
* ? | ||||
10 | 10th July — 16th July | 84% | ||
* ? | ||||
* ? | ||||
11 | 17th July — 23rd July | 86% | ||
Phase 2 evaluation | ||||
Deliverable: Nouns, adjectives testvoc clean | ||||
* ? | ||||
12 | 24th July — 30th July | 88% | ||
* ? | ||||
13 | 1st August — 6th August | 89% | ||
* ? | ||||
14 | 7th August — 13th August | 90% | ||
* ? | ||||
15 | 14th August — 20th August | 91% | ||
* ? | ||||
16 | 21th August — 27th August | 92% | ||
Final evaluation | ||||
Final deliverable: Full MT system, testvoc clean. | ||||
* Evaluation | ||||
* Write paper | ||||
17 | 28th August — 3rd September | |||
* Write paper | ||||
18 | 4th September — 6th September | |||
* Write paper |