Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

From Apertium
Jump to navigation Jump to search
 
(29 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
* Initialize kaz-uzb pair
 
* Initialize kaz-uzb pair
 
* Collect data in both languages
 
* Collect data in both languages
  +
| style = "text-align: center;" | 426
| 426
 
  +
(+426)
| -
 
  +
| style = "text-align: center;" | 43.80 %
 
| -
 
| -
 
| -
 
| -
Line 34: Line 35:
 
June 6-12
 
June 6-12
 
|Make Uzbek better
 
|Make Uzbek better
  +
| style = "text-align: center;" | 2220
| -
 
  +
(+1794)
| -
 
  +
| style = "text-align: center;" | 52.11 %
 
| -
 
| -
 
| -
 
| -
Line 47: Line 49:
 
June 13-19
 
June 13-19
 
| Expand bilingual dictionary
 
| Expand bilingual dictionary
  +
| style = "text-align: center;" | 5262
| -
 
  +
(+3042)
| -
 
  +
| style = "text-align: center;" | 77.03 %
 
| 74.77% / 67.57%
 
| 74.77% / 67.57%
 
| 64.23% / 54.37%
 
| 64.23% / 54.37%
Line 57: Line 60:
 
June 20-26
 
June 20-26
 
| More on .dix and .lrx
 
| More on .dix and .lrx
  +
| style = "text-align: center;" | 8543
| -
 
  +
(+3281)
| -
 
  +
| style = "text-align: center;" | 81.55 %
 
| 74.77% / 67.57%
 
| 74.77% / 67.57%
 
| 64.23% / 54.37%
 
| 64.23% / 54.37%
Line 68: Line 72:
 
June 27-July 3
 
June 27-July 3
 
|Focus on transfer rules
 
|Focus on transfer rules
  +
| style = "text-align: center;" | 9432
| -
 
  +
(+889)
| -
 
  +
| style = "text-align: center;" | 81.85 %
 
| 74.77% / 67.57%
 
| 74.77% / 67.57%
 
| 64.23% / 54.37%
 
| 64.23% / 54.37%
Line 78: Line 83:
 
July 4-10
 
July 4-10
 
|Test translator and expand more
 
|Test translator and expand more
  +
| style = "text-align: center;" | 11008
| 11008
 
  +
(+1576)
| 82.81%
 
  +
| style = "text-align: center;" | 82.81%
 
| 74.77% / 67.57%
 
| 74.77% / 67.57%
 
| 64.23% / 54.37%
 
| 64.23% / 54.37%
Line 90: Line 96:
 
July 11-17
 
July 11-17
 
|Focus more on transfer rules
 
|Focus more on transfer rules
  +
| style = "text-align: center;" | +191
| -
 
  +
| style = "text-align: center;" | 84.46 %
| -
 
  +
| 71.83% / 63.52%
| -
 
  +
| 67.24% / 60.31%
| -
 
| -
+
|
  +
* Addeed more stems to the bidix to reach 85% coverage
  +
* Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations
  +
* Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations
  +
* Prepared 200 sentences of parallel corpora.
 
|-
 
|-
 
|Week 7
 
|Week 7
 
July 18-24
 
July 18-24
 
|Test the kaz-uzb translator
 
|Test the kaz-uzb translator
  +
| style = "text-align: center;" | +219
| -
 
  +
| style = "text-align: center;" | 84.58 %
| -
 
  +
| 69.21% / 60.69%
| -
 
  +
| 67.10% / 59.89%
| -
 
| -
+
|
  +
* Started the transfer rules
  +
* Some more bidix
  +
* Some Lexical selection rules
 
|-
 
|-
 
|Week 8
 
|Week 8
 
July 25-31
 
July 25-31
 
|Focus on transfer rules
 
|Focus on transfer rules
  +
| style = "text-align: center;" | +500
| -
 
  +
| style = "text-align: center;" | 85.17 %
| -
 
  +
| 69.21% / 60.69%
| -
 
  +
| 67.10% / 59.89%
| -
 
| -
+
|
  +
* Some additions to apertium-kaz & apertium-uzb
  +
* Some transfer rules
  +
* Some more lexical selection rules & bidix
 
|-
 
|-
 
|Week 9
 
|Week 9
 
August 1-7
 
August 1-7
 
|Focus on testvoc
 
|Focus on testvoc
  +
| style = "text-align: center;" | +1000
| -
 
  +
| style = "text-align: center;" | 86.02 %
| -
 
  +
| 71.37% / 64.81%
| -
 
  +
| 61.93% / 56.05%
| -
 
| -
+
|
  +
* More frequent words added to the dictionary
 
|-
 
|-
 
|Week 10
 
|Week 10
 
August 8-14
 
August 8-14
 
|Finalize work
 
|Finalize work
  +
| style = "text-align: center;" | +400
| -
 
  +
| style = "text-align: center;" | 86.02 %
| -
 
  +
| 71.37% / 64.81%
| -
 
  +
| 61.93% / 56.05%
| -
 
| -
+
|
  +
* More post-editing done
 
|-
 
|-
 
|}
 
|}
  +
  +
== TODO ==
  +
* More transfer rules
  +
* Testvoc
  +
  +
== ONGOING ==
  +
  +
* Transfer rules
  +
* Some lexical selection rules
  +
* Collecting more bidix
  +
** Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached
  +
  +
== DONE (+Notes & Comments) ==
  +
* Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
  +
* Translating big Kaz text into Uzb
  +
** Aside from those small texts obtained from parallel corpora
  +
** For better WER/PER calculation
  +
** For checking transfer rules
  +
** Started with readily available small JaM Story
  +
*** Azmat and Oygul story in our case
  +
*** Added 47 sentences from that.
  +
** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
  +
*** Split the article into sentences, translated into uzb manually.
  +
*** Added 112 sentences out of Nur-Sultan text.
  +
** Used QED parallel corpus form internet.
  +
*** Manually selected sentences and corrected their translations.
  +
*** Added 90 sentences from this.
  +
** Made 250 sentences in total, stopping this here.
  +
* Tried finding parallel corpora:
  +
** ''https://opus.nlpl.eu/''
  +
*** KDE4: Not good: ''https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html''
  +
*** MozillaI10: Nope
  +
*** '''QED''':
  +
**** kk-uz.tmx file
  +
**** Good, Gonna use this, but needs manual selection and corrections on translations.
  +
*** TED: the same file with QED.
  +
*** The rest is also not good, just one-word translation mostly.
  +
* Lexical selection rules for kaz-uzb
  +
** Created a script to analyse bidix and find ambiguous stems.
  +
** Created a script that generates lexical selection rules from manually chosen stems
  +
** Rules added in kaz-uzb.lrx for 1565 words.
  +
** Rules added in uzb-kaz.lrx for 1295 words.
  +
* '''Apertium-dixtools:'''
  +
** ''https://wiki.apertium.org/wiki/Apertium-dixtools''
  +
** Fixing bidix(deduplicating, removing empty lines):
  +
apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
  +
** Sorting bidix(aligning too):
  +
apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
  +
* Sorting+Deduplicating the bidix for future purposes:
  +
** Bidix count before deduplication: 11008
  +
** Fixing the bidix:
  +
*** Removed 1501 entries in section main
  +
** Bidix count after deduplication: 9507
  +
** Sorted+Regrouped the dix.
  +
* Made a script to calculate DixCount, Coverage, WER/PER at once
  +
* '''Calculating WER/PER:'''
  +
** Apertium-eval-translator:
  +
*** ''https://github.com/apertium/apertium-eval-translator''
  +
$ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
  +
$ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
  +
$ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
  +
$ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
  +
** '''Parallel text:'''
  +
*** JaM Story:
  +
*** “Azamat va Oygul” in our case;
  +
*** kaz-uzb/texts/[kaz|uzb].txt
  +
** Astana article from Kazakh Wiki
  +
* '''Calculated dix Coverage:'''
  +
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
  +
** coverage: 26752095 / 32305875 (~0.82808761564266561423)
  +
** remaining unknown forms: 5553780
  +
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
  +
* '''Counting Dix elements:'''
  +
** Apertium-Eval: dixcounter.py:
  +
python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
  +
  +
** July 09: 11008 dix elements before deduplication.
  +
* Translating kaz-uig.dix into kaz-uzb
  +
* Translating kaz-kaa.dix into kaz-uzb.dix
  +
** Removing those that were already done by crossdic
  +
** Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
  +
* Translating kaz-tur.dix into kaz-uzb.dix
  +
** Removing those that were already done by crossdic
  +
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words
  +
** Added 3200 more words from this.
  +
* '''Extract Kazakh wikipedia:'''
  +
** Kazakh wiki date: 01.05.2021
  +
*** https://dumps.wikimedia.org/kkwiki/20210501/
  +
** Apertium-tools: WikiExtractor:
  +
*** Wiki: ''https://wiki.apertium.org/wiki/WikiExtractor''
  +
*** Code: ''https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py''
  +
python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
  +
  +
** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
  +
* '''Big Crossdix action:'''
  +
** Crossing dictionaries: kaz-kaa and kaa-uzb
  +
** Crossing dictionaries: kaz-tur and tur-uzb
  +
** Merge the obtained crossdix outputs from two
  +
** Sort the merged file, removing duplicates
  +
** Align the result for better visibility
  +
** Manually check every translation, remove if bad, correct/add if necessary
  +
** Added 5000+ words from this.
  +
* Start collecting the kaz-uzb bilingual dictionary
  +
* '''Convert the pair to apertium-recursive'''
  +
** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
  +
* Write some lexical selection rules in kaz-uzb.
  +
** Wiki: ''https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules''
  +
** Definitely needed samples:
  +
*** The closest sample is apertium-kaz-tur
  +
*** ''https://github.com/apertium/apertium-kaz-tur''
  +
* '''Translate small text (James & Mary story)'''
  +
** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
  +
** James&Mary story was downloaded from the source.
  +
** apertium-kaz-uzb/texts/
  +
** Updated as Azamat & Oygul story
  +
* Translating a text file:
  +
cat texts/kaz.txt | apertium -d . -f line kaz-uzb
  +
* Translation of a sentence:
  +
echo 'Сәлем Әлем' | apertium -d . kaz-uzb
  +
* Bootstrapping a new language pair apertium-kaz-uzb
  +
** Installed Apertium-init from pip
  +
*** Had a problem, solved it (tanks to @popcorndude).
  +
** Downloaded apertium-init.py
  +
*** Did not work
  +
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
  +
cd apertium-kaz-uzb
  +
./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
  +
** I had to convert it to apertium-recursive, so further flags to be added:
  +
*** ''-t rtx''
  +
* Final initialization command:
  +
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
  +
* '''Forked(&Installed) necessary repos on GitHub:'''
  +
** '''''Apertium-kaz'''''
  +
*** ''git clone git@github.com:kamush901/apertium-kaz.git''
  +
*** Works well
  +
** '''''Apertium-uzb'''''
  +
*** ''git@github.com:kamush901/apertium-uzb.git''
  +
*** Works well
  +
* '''Installed Apertium and necessary tools:'''
  +
** Installed Apertium core using packaging
  +
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
  +
sudo apt-get -f install apertium-all-dev

Latest revision as of 11:51, 31 August 2021

Progress Report[edit]

Time Period Goal Bidix Coverage WER,PER Details/Comments
kaz-uzb kaz-uzb kaz-uzb uzb-kaz
Community Bonding Period

May 17-June 5

  • Installed Apertium
  • Initialize kaz-uzb pair
  • Collect data in both languages
426

(+426)

43.80 % - -
  • Installed Apertium and necessary tools;
  • Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair
  • Translated a small sample text;
  • Extracted Uzbek and Kazakh wiki corpus;
  • Collected Kazakh-Uzbek dictionary and parallel corpora;
Week 1

June 6-12

Make Uzbek better 2220

(+1794)

52.11 % - -
  • Went through all Uzbek and Kazakh stems;
  • Initialized the pair with apertium-recursive;
  • Collected dictionaries from other pairs for crossdic;
  • Obtained crossdic results from two ways.
Week 2

June 13-19

Expand bilingual dictionary 5262

(+3042)

77.03 % 74.77% / 67.57% 64.23% / 54.37%
  • Started adding bilingual dictionary elements;
Week 3

June 20-26

More on .dix and .lrx 8543

(+3281)

81.55 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Started sample Lexical selection rules;
Week 4

June 27-July 3

Focus on transfer rules 9432

(+889)

81.85 % 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary more;
Week 5

July 4-10

Test translator and expand more 11008

(+1576)

82.81% 74.77% / 67.57% 64.23% / 54.37%
  • Expanded bilingual dictionary;
  • Collected texts for lexical selection rules, tried a small script;
  • Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
Week 6

July 11-17

Focus more on transfer rules +191 84.46 % 71.83% / 63.52% 67.24% / 60.31%
  • Addeed more stems to the bidix to reach 85% coverage
  • Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations
  • Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations
  • Prepared 200 sentences of parallel corpora.
Week 7

July 18-24

Test the kaz-uzb translator +219 84.58 % 69.21% / 60.69% 67.10% / 59.89%
  • Started the transfer rules
  • Some more bidix
  • Some Lexical selection rules
Week 8

July 25-31

Focus on transfer rules +500 85.17 % 69.21% / 60.69% 67.10% / 59.89%
  • Some additions to apertium-kaz & apertium-uzb
  • Some transfer rules
  • Some more lexical selection rules & bidix
Week 9

August 1-7

Focus on testvoc +1000 86.02 % 71.37% / 64.81% 61.93% / 56.05%
  • More frequent words added to the dictionary
Week 10

August 8-14

Finalize work +400 86.02 % 71.37% / 64.81% 61.93% / 56.05%
  • More post-editing done

TODO[edit]

  • More transfer rules
  • Testvoc

ONGOING[edit]

  • Transfer rules
  • Some lexical selection rules
  • Collecting more bidix
    • Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached

DONE (+Notes & Comments)[edit]

  • Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
  • Translating big Kaz text into Uzb
    • Aside from those small texts obtained from parallel corpora
    • For better WER/PER calculation
    • For checking transfer rules
    • Started with readily available small JaM Story
      • Azmat and Oygul story in our case
      • Added 47 sentences from that.
    • Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
      • Split the article into sentences, translated into uzb manually.
      • Added 112 sentences out of Nur-Sultan text.
    • Used QED parallel corpus form internet.
      • Manually selected sentences and corrected their translations.
      • Added 90 sentences from this.
    • Made 250 sentences in total, stopping this here.
  • Tried finding parallel corpora:
  • Lexical selection rules for kaz-uzb
    • Created a script to analyse bidix and find ambiguous stems.
    • Created a script that generates lexical selection rules from manually chosen stems
    • Rules added in kaz-uzb.lrx for 1565 words.
    • Rules added in uzb-kaz.lrx for 1295 words.
  • Apertium-dixtools:
   apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
    • Sorting bidix(aligning too):
   apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
  • Sorting+Deduplicating the bidix for future purposes:
    • Bidix count before deduplication: 11008
    • Fixing the bidix:
      • Removed 1501 entries in section main
    • Bidix count after deduplication: 9507
    • Sorted+Regrouped the dix.
  • Made a script to calculate DixCount, Coverage, WER/PER at once
  • Calculating WER/PER:
   $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
   $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
   $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
   $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
    • Parallel text:
      • JaM Story:
      • “Azamat va Oygul” in our case;
      • kaz-uzb/texts/[kaz|uzb].txt
    • Astana article from Kazakh Wiki
  • Calculated dix Coverage:
    • (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
    • coverage: 26752095 / 32305875 (~0.82808761564266561423)
    • remaining unknown forms: 5553780
    • kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
  • Counting Dix elements:
    • Apertium-Eval: dixcounter.py:
   python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
   python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
    • Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
  • Big Crossdix action:
    • Crossing dictionaries: kaz-kaa and kaa-uzb
    • Crossing dictionaries: kaz-tur and tur-uzb
    • Merge the obtained crossdix outputs from two
    • Sort the merged file, removing duplicates
    • Align the result for better visibility
    • Manually check every translation, remove if bad, correct/add if necessary
    • Added 5000+ words from this.
  • Start collecting the kaz-uzb bilingual dictionary
  • Convert the pair to apertium-recursive
    • Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
  • Write some lexical selection rules in kaz-uzb.
  • Translate small text (James & Mary story)
  • Translating a text file:
   cat texts/kaz.txt | apertium -d . -f line kaz-uzb
  • Translation of a sentence:
   echo 'Сәлем Әлем' | apertium -d . kaz-uzb
  • Bootstrapping a new language pair apertium-kaz-uzb
    • Installed Apertium-init from pip
      • Had a problem, solved it (tanks to @popcorndude).
    • Downloaded apertium-init.py
      • Did not work
   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
   cd apertium-kaz-uzb
   ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
    • I had to convert it to apertium-recursive, so further flags to be added:
      • -t rtx
  • Final initialization command:
   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
  • Forked(&Installed) necessary repos on GitHub:
    • Apertium-kaz
      • git clone git@github.com:kamush901/apertium-kaz.git
      • Works well
    • Apertium-uzb
      • git@github.com:kamush901/apertium-uzb.git
      • Works well
  • Installed Apertium and necessary tools:
    • Installed Apertium core using packaging
   wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
   sudo apt-get -f install apertium-all-dev