Difference between revisions of "Crimean Tatar and Turkish/Work plan"

Latest revision as of 18:27, 19 June 2017

Week	Dates	Coverage	Achieved	Evaluation
3	22nd May — 28th May	40%	43.9%	✔
* Add all non-inflecting words
* Finish challenge text (no *,#)
* Do baseline evaluation (WER)
Official start
4	29th May — 4th June	40%		✔
* Break
5	5th June — 11th June	65%		✔
* ?
6	12th June — 18th June	75%		✔
* ?
* ?
7	19th June — 25th June	80%
Phase 1 evaluation
Deliverable: All closed classes + numerals testvoc clean
8	26th June — 2nd July	84%
* ?
* ?
9	3rd July — 9th July	84%
* ?
10	10th July — 16th July	84%
* ?
* ?
11	17th July — 23rd July	86%
Phase 2 evaluation
Deliverable: Nouns, adjectives testvoc clean
* ?
12	24th July — 30th July	88%
* ?
13	1st August — 6th August	89%
* ?
14	7th August — 13th August	90%
* ?
15	14th August — 20th August	91%
* ?
16	21th August — 27th August	92%
Final evaluation
Final deliverable: Full MT system, testvoc clean.
* Evaluation
* Write paper
17	28th August — 3rd September
* Write paper
18	4th September — 6th September
* Write paper

Coverage[edit]

To measure the bidix-trimmed coverage, use apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh:

apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
                  bash testvoc/corpus/trimmed-coverage.sh | less

Number of tokenised words in the corpus:         148013
Number of tokenised words unknown to analyser:    63730  —  43.1 % of tokens had *
                          unknown to bidix:         112  —   0.1 % of tokens had @
     w/transfer errors or unknown to generator:    2473  —   1.7 % of tokens had #

Error-free coverage of analyser only:             84283  —  56.9 % of tokens had no *
Error-free coverage of analyser and bidix:        84171  —  56.9 % of tokens had no */@
Error-free coverage of the full translator:       81698  —  55.2 % of tokens had no */@/#

Top unknown words in the corpus:
    972 ^*Ukrainanıñ$
    939 ^*vilâyetinde$
    631 ^*şeklinde$
    607 ^*qasaba$
    508 ^*merkezi$
    434 ^*rayonınıñ$
    329 ^*da$
    283 ^*de$
    235 ^*adı$
    221 ^*vilâyeti$

Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
        
        
^Baş<n><nom>$   Baş
^*Saife$        *Saife

...

Testvoc[edit]

Requirements for testvoc in week 1:

all pronouns from Crimean Tatar corpora are translated without debug symbols
all pronouns the transducer generates must pass without debug symbols (this is less important, and only to focus on if done with 1)

To achieve 1:

analyse corpora with crh-morph mode
grep pronouns
make sure they pass through the rest of the pipeline without getting @ or #

To achieve 2:

in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
generate pronouns with hfst-fst2string crh.automorf.hfst
make sure they pass through the rest of the pipeline without getting @ or #

We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.

@@ Line 1: / Line 1: @@
-What [[User:IlnarSalimzyan|selimcan]] expects:
-* [[Calculating coverage|Bidix-trimmed coverage]] 90% on average.
-* [[Testvoc#Corpus testvoc|Corpus testvoc]] clean on all Crimean Tatar corpora we have.
-* Tests in [[Crimean Tatar and Turkish/Pending tests|Pending tests]] pass and thus are moved to [[Crimean Tatar and Turkish/Regression tests|Regression tests]]
 {|class=wikitable
 ! Week !! Dates                 !!  Coverage !! Achieved !! Evaluation
 |-
-| 3  ||22nd May &mdash; 28th May || 40% || 43.9% ||
+| 3  ||22nd May &mdash; 28th May || 40% || 43.9% || '''✔'''
 |-
@@ Line 21: / Line 15: @@
 |-
-| 4  ||29th May &mdash; 4th June || 40% || ||
+| 4  ||29th May &mdash; 4th June || 40% || || '''✔'''
 |-
@@ Line 27: / Line 21: @@
 |-
-| 5  ||5th June &mdash; 11th June || 65%  || ||
+| 5  ||5th June &mdash; 11th June || 65%  || || '''✔'''
 |-
@@ Line 33: / Line 27: @@
 |-
-| 6  ||12th June &mdash; 18th June || 70%  || ||
+| 6  ||12th June &mdash; 18th June || 75%  || || '''✔'''
 |-
@@ Line 57: / Line 51: @@
 |-
-| 9  ||3rd July &mdash; 9th July || 82%  || ||
+| 9  ||3rd July &mdash; 9th July || 84%  || ||
 |-
@@ Line 136: / Line 130: @@
 |-
 |}
+=== Coverage ===
+To measure the bidix-trimmed coverage, use <code>apertium-crh-tur/testvoc/corpus/trimmed-coverage.sh</code>:
+<pre>
+apertium-crh-tur$ bzcat ~/src/turkiccorpora/crh.wpdump.20151123.txt.bz2 | \
+                  bash testvoc/corpus/trimmed-coverage.sh | less
+Number of tokenised words in the corpus:         148013
+Number of tokenised words unknown to analyser:    63730  —  43.1 % of tokens had *
+                          unknown to bidix:         112  —   0.1 % of tokens had @
+     w/transfer errors or unknown to generator:    2473  —   1.7 % of tokens had #
+Error-free coverage of analyser only:             84283  —  56.9 % of tokens had no *
+Error-free coverage of analyser and bidix:        84171  —  56.9 % of tokens had no */@
+Error-free coverage of the full translator:       81698  —  55.2 % of tokens had no */@/#
+Top unknown words in the corpus:
+^*Ukrainanıñ$
+^*vilâyetinde$
+^*şeklinde$
+^*qasaba$
+^*merkezi$
+^*rayonınıñ$
+^*da$
+^*de$
+^*adı$
+^*vilâyeti$
+Tokens needed to get 65.0 % bidix-trimmed coverage (no */@/#): 12037
+Storing corresponding wordlist in /tmp/corpus-stat-all-needed.txt
+^Baş<n><nom>$   Baş
+^*Saife$        *Saife
+...
+</pre>
+=== Testvoc ===
+Requirements for testvoc in week 1:
+# all pronouns from Crimean Tatar corpora  are translated without debug symbols
+# all pronouns the transducer generates must pass without debug symbols  (this is less important, and only to focus on if done with 1)
+To achieve 1:
+* analyse corpora with crh-morph mode
+* grep pronouns
+* make sure they pass through the rest of the pipeline without getting @ or #
+To achieve 2:
+* in 'Root' lexicon of the .lexc file, comment out everything except Pronouns
+* generate pronouns with <code>hfst-fst2string crh.automorf.hfst</code>
+* make sure they pass through the rest of the pipeline without getting @ or #
+We don't want to spend too much time on forms which might be over-generated by the transducer. This is the reason why we focus on 1 first.
 [[Category:Crimean Tatar and Turkish|Work plan]]

Difference between revisions of "Crimean Tatar and Turkish/Work plan"

Latest revision as of 18:27, 19 June 2017

Coverage[edit]

Testvoc[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools