Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

From Apertium
Jump to navigation Jump to search
Line 159: Line 159:
== DONE (+Notes & Comments) ==
== DONE (+Notes & Comments) ==
* Made a script to calculate DixCount, Coverage, WER/PER at once
* Made a script to calculate DixCount, Coverage, WER/PER at once
* Calculating WER/PER:
* '''Calculating WER/PER:'''
** Apertium-eval-translator:
** Apertium-eval-translator:
*** https://github.com/apertium/apertium-eval-translator
*** ''https://github.com/apertium/apertium-eval-translator''
*** apertium-eval-translator -ref uzb.txt -test kaz-uzb.txt
apertium-eval-translator -ref uzb.txt -test kaz-uzb.txt
** Parallel text:
** '''Parallel text:'''
*** JaM Story:
*** JaM Story:
*** “Azamat va Oygul” in our case;
*** “Azamat va Oygul” in our case;
*** kaz-uzb/texts/[kaz|uzb].txt
*** kaz-uzb/texts/[kaz|uzb].txt
** Astana article from Kazakh Wiki
** Astana article from Kazakh Wiki
* Calculated dix Coverage:
* '''Calculated dix Coverage:'''
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
** coverage: 26752095 / 32305875 (~0.82808761564266561423)
** coverage: 26752095 / 32305875 (~0.82808761564266561423)
** remaining unknown forms: 5553780
** remaining unknown forms: 5553780
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
* Counting Dix elements:
* '''Counting Dix elements:'''
** Apertium-Eval: dixcounter.py:
** Apertium-Eval: dixcounter.py:
*** python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
** July 09: 11008 dix elements before deduplication.
** July 09: 11008 dix elements before deduplication.
* Translating kaz-uig.dix into kaz-uzb
* Translating kaz-uig.dix into kaz-uzb
Line 185: Line 185:
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words
** Added 3200 more words from this.
** Added 3200 more words from this.
* Extract Kazakh wikipedia:
* '''Extract Kazakh wikipedia:'''
** Kazakh wiki date: 01.05.2021
** Kazakh wiki date: 01.05.2021
*** https://dumps.wikimedia.org/kkwiki/20210501/
*** https://dumps.wikimedia.org/kkwiki/20210501/
** Apertium-tools: WikiExtractor:
** Apertium-tools: WikiExtractor:
*** Wiki: https://wiki.apertium.org/wiki/WikiExtractor
*** Wiki: ''https://wiki.apertium.org/wiki/WikiExtractor''
*** Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
*** Code: ''https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py''
*** python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
* Big Crossdix action:
* '''Big Crossdix action:'''
** Crossing dictionaries: kaz-kaa and kaa-uzb
** Crossing dictionaries: kaz-kaa and kaa-uzb
** Crossing dictionaries: kaz-tur and tur-uzb
** Crossing dictionaries: kaz-tur and tur-uzb
Line 202: Line 202:
** Added 5000+ words from this.
** Added 5000+ words from this.
* Start collecting the kaz-uzb bilingual dictionary
* Start collecting the kaz-uzb bilingual dictionary
* Convert the pair to apertium-recursive
* '''Convert the pair to apertium-recursive'''
** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
* Write some lexical selection rules in kaz-uzb.
* Write some lexical selection rules in kaz-uzb.
** Wiki: https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules
** Wiki: ''https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules''
** Definitely needed samples:
** Definitely needed samples:
*** The closest sample is apertium-kaz-tur
*** The closest sample is apertium-kaz-tur
*** https://github.com/apertium/apertium-kaz-tur
*** ''https://github.com/apertium/apertium-kaz-tur''
* Translate small text (James & Mary story)
* '''Translate small text (James & Mary story)'''
** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
** James&Mary story was downloaded from the source.
** James&Mary story was downloaded from the source.
Line 215: Line 215:
** Updated as Azamat & Oygul story
** Updated as Azamat & Oygul story
* Translating a text file:
* Translating a text file:
** cat texts/kaz.txt | apertium -d . -f line kaz-uzb
cat texts/kaz.txt | apertium -d . -f line kaz-uzb
* Translation of a sentence:
* Translation of a sentence:
** echo 'Сәлем Әлем' | apertium -d . kaz-uzb
echo 'Сәлем Әлем' | apertium -d . kaz-uzb
* Bootstrapping a new language pair apertium-kaz-uzb
* Bootstrapping a new language pair apertium-kaz-uzb
** Installed Apertium-init from pip
** Installed Apertium-init from pip
Line 223: Line 223:
** Downloaded apertium-init.py
** Downloaded apertium-init.py
*** Did not work
*** Did not work
** python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
** cd apertium-kaz-uzb
cd apertium-kaz-uzb
** ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
** I had to convert it to apertium-recursive, so further flags to be added:
** I had to convert it to apertium-recursive, so further flags to be added:
*** -t rtx
*** ''-t rtx''
* Final initialization command:
* Final initialization command:
** python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
* Forked(&Installed) necessary repos on GitHub:
* '''Forked(&Installed) necessary repos on GitHub:'''
** Apertium-kaz
** '''''Apertium-kaz'''''
*** git clone git@github.com:kamush901/apertium-kaz.git
*** ''git clone git@github.com:kamush901/apertium-kaz.git''
*** Works well
*** Works well
** Apertium-uzb
** '''''Apertium-uzb'''''
*** git@github.com:kamush901/apertium-uzb.git
*** ''git@github.com:kamush901/apertium-uzb.git''
*** Works well
*** Works well
* Installed Apertium and necessary tools:
* '''Installed Apertium and necessary tools:'''
** Installed Apertium core using packaging
** Installed Apertium core using packaging
*** wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
*** sudo apt-get -f install apertium-all-dev
sudo apt-get -f install apertium-all-dev

Revision as of 10:47, 12 July 2021

Progress Report

Time Period Goal Bidix Coverage WER,PER Details/Comments
kaz-uzb kaz-uzb kaz-uzb uzb-kaz
Community Bonding Period

May 17-June 5

  • Installed Apertium
  • Initialize kaz-uzb pair
  • Collect data in both languages
426

(+426)

43.80 % - -
  • Installed Apertium and necessary tools;
  • Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair
  • Translated a small sample text;
  • Extracted Uzbek and Kazakh wiki corpus;
  • Collected Kazakh-Uzbek dictionary and parallel corpora;
Week 1

June 6-12

Make Uzbek better 2220

(+1794)

52.11 % - -
  • Went through all Uzbek and Kazakh stems;
  • Initialized the pair with apertium-recursive;
  • Collected dictionaries from other pairs for crossdic;
  • Obtained crossdic results from two ways.
Week 2

June 13-19

Expand bilingual dictionary 5262

(+3042)

77.03 % 74.77% / 67.57% 64.23% / 54.37%
  • Started adding bilingual dictionary elements;
Week 3

June 20-26

More on .dix and .lrx 8543

(+3281)

81.55 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Started sample Lexical selection rules;
Week 4

June 27-July 3

Focus on transfer rules 9432

(+889)

81.85 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary more;
Week 5

July 4-10

Test translator and expand more 11008

(+1576)

82.81% 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Collected texts for lexical selection rules, tried a small script;
  • Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
Week 6

July 11-17

Focus more on transfer rules - - - - -
Week 7

July 18-24

Test the kaz-uzb translator - - - - -
Week 8

July 25-31

Focus on transfer rules - - - - -
Week 9

August 1-7

Focus on testvoc - - - - -
Week 10

August 8-14

Finalize work - - - - -

TODO

  • Writing lexical selection rules for uzb-kaz
  • Transfer rules
  • Testvoc


ONGOING

  • Lexical selection rules for kaz-uzb
  • Translating big Kaz text into Uzb
    • For better WER/PER calculation
    • For checking transfer rules
    • Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
    • Made 112 sentences out of Nur-Sultan.
  • Collecting more bidix


DONE (+Notes & Comments)

   apertium-eval-translator -ref uzb.txt -test kaz-uzb.txt
    • Parallel text:
      • JaM Story:
      • “Azamat va Oygul” in our case;
      • kaz-uzb/texts/[kaz|uzb].txt
    • Astana article from Kazakh Wiki
  • Calculated dix Coverage:
    • (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
    • coverage: 26752095 / 32305875 (~0.82808761564266561423)
    • remaining unknown forms: 5553780
    • kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
  • Counting Dix elements:
    • Apertium-Eval: dixcounter.py:
   python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
   python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
    • Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
  • Big Crossdix action:
    • Crossing dictionaries: kaz-kaa and kaa-uzb
    • Crossing dictionaries: kaz-tur and tur-uzb
    • Merge the obtained crossdix outputs from two
    • Sort the merged file, removing duplicates
    • Align the result for better visibility
    • Manually check every translation, remove if bad, correct/add if necessary
    • Added 5000+ words from this.
  • Start collecting the kaz-uzb bilingual dictionary
  • Convert the pair to apertium-recursive
    • Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
  • Write some lexical selection rules in kaz-uzb.
  • Translate small text (James & Mary story)
  • Translating a text file:
   cat texts/kaz.txt | apertium -d . -f line kaz-uzb
  • Translation of a sentence:
   echo 'Сәлем Әлем' | apertium -d . kaz-uzb
  • Bootstrapping a new language pair apertium-kaz-uzb
    • Installed Apertium-init from pip
      • Had a problem, solved it (tanks to @popcorndude).
    • Downloaded apertium-init.py
      • Did not work
   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
   cd apertium-kaz-uzb
   ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
    • I had to convert it to apertium-recursive, so further flags to be added:
      • -t rtx
  • Final initialization command:
   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
  • Forked(&Installed) necessary repos on GitHub:
    • Apertium-kaz
      • git clone git@github.com:kamush901/apertium-kaz.git
      • Works well
    • Apertium-uzb
      • git@github.com:kamush901/apertium-uzb.git
      • Works well
  • Installed Apertium and necessary tools:
    • Installed Apertium core using packaging
   wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
   sudo apt-get -f install apertium-all-dev