Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

From Apertium
Jump to navigation Jump to search
Line 158: Line 158:


== DONE (NOTES) ==
== DONE (NOTES) ==
* Made a script to calculate DixCount, Coverage, WER/PER at once
* Calculating WER/PER:
** Apertium-eval-translator:
*** https://github.com/apertium/apertium-eval-translator
*** apertium-eval-translator -ref uzb.txt -test kaz-uzb.txt
** Parallel text:
*** JaM Story:
*** “Azamat va Oygul” in our case;
*** kaz-uzb/texts/[kaz|uzb].txt
** Astana article from Kazakh Wiki
* Calculated dix Coverage:
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
** coverage: 26752095 / 32305875 (~0.82808761564266561423)
** remaining unknown forms: 5553780
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
* Counting Dix elements:
** Apertium-Eval: dixcounter.py:
*** python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
** July 09: 11008 dix elements before deduplication.
* Translating kaz-uig.dix into kaz-uzb
* Translating kaz-kaa.dix into kaz-uzb.dix
** Removing those that were already done by crossdic
** Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
* Translating kaz-tur.dix into kaz-uzb.dix
** Removing those that were already done by crossdic
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words
** Added 3200 more words from this.

Revision as of 10:23, 12 July 2021

Progress Report

Time Period Goal Bidix Coverage WER,PER Details/Comments
kaz-uzb kaz-uzb kaz-uzb uzb-kaz
Community Bonding Period

May 17-June 5

  • Installed Apertium
  • Initialize kaz-uzb pair
  • Collect data in both languages
426

(+426)

43.80 % - -
  • Installed Apertium and necessary tools;
  • Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair
  • Translated a small sample text;
  • Extracted Uzbek and Kazakh wiki corpus;
  • Collected Kazakh-Uzbek dictionary and parallel corpora;
Week 1

June 6-12

Make Uzbek better 2220

(+1794)

52.11 % - -
  • Went through all Uzbek and Kazakh stems;
  • Initialized the pair with apertium-recursive;
  • Collected dictionaries from other pairs for crossdic;
  • Obtained crossdic results from two ways.
Week 2

June 13-19

Expand bilingual dictionary 5262

(+3042)

77.03 % 74.77% / 67.57% 64.23% / 54.37%
  • Started adding bilingual dictionary elements;
Week 3

June 20-26

More on .dix and .lrx 8543

(+3281)

81.55 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Started sample Lexical selection rules;
Week 4

June 27-July 3

Focus on transfer rules 9432

(+889)

81.85 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary more;
Week 5

July 4-10

Test translator and expand more 11008

(+1576)

82.81% 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Collected texts for lexical selection rules, tried a small script;
  • Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
Week 6

July 11-17

Focus more on transfer rules - - - - -
Week 7

July 18-24

Test the kaz-uzb translator - - - - -
Week 8

July 25-31

Focus on transfer rules - - - - -
Week 9

August 1-7

Focus on testvoc - - - - -
Week 10

August 8-14

Finalize work - - - - -

TODO

  • Writing lexical selection rules for uzb-kaz
  • Transfer rules
  • Testvoc


ONGOING

  • Lexical selection rules for kaz-uzb
  • Translating big Kaz text into Uzb
    • For better WER/PER calculation
    • For checking transfer rules
    • Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
    • Made 112 sentences out of Nur-Sultan.
  • Collecting more bidix


DONE (NOTES)

  • Made a script to calculate DixCount, Coverage, WER/PER at once
  • Calculating WER/PER:
  • Calculated dix Coverage:
    • (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
    • coverage: 26752095 / 32305875 (~0.82808761564266561423)
    • remaining unknown forms: 5553780
    • kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
  • Counting Dix elements:
    • Apertium-Eval: dixcounter.py:
      • python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
    • July 09: 11008 dix elements before deduplication.
  • Translating kaz-uig.dix into kaz-uzb
  • Translating kaz-kaa.dix into kaz-uzb.dix
    • Removing those that were already done by crossdic
    • Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
  • Translating kaz-tur.dix into kaz-uzb.dix
    • Removing those that were already done by crossdic
    • Changing turkish translation into uzbek one by looking at both kazakh and turkish words
    • Added 3200 more words from this.