Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

Latest revision as of 11:51, 31 August 2021

Progress Report[edit]

Time Period	Goal	Bidix	Coverage	WER,PER		Details/Comments
Time Period	Goal	kaz-uzb	kaz-uzb	kaz-uzb	uzb-kaz	Details/Comments
Community Bonding Period May 17-June 5	Installed Apertium Initialize kaz-uzb pair Collect data in both languages	426 (+426)	43.80 %	-	-	Installed Apertium and necessary tools; Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair Translated a small sample text; Extracted Uzbek and Kazakh wiki corpus; Collected Kazakh-Uzbek dictionary and parallel corpora;
Week 1 June 6-12	Make Uzbek better	2220 (+1794)	52.11 %	-	-	Went through all Uzbek and Kazakh stems; Initialized the pair with apertium-recursive; Collected dictionaries from other pairs for crossdic; Obtained crossdic results from two ways.
Week 2 June 13-19	Expand bilingual dictionary	5262 (+3042)	77.03 %	74.77% / 67.57%	64.23% / 54.37%	Started adding bilingual dictionary elements;
Week 3 June 20-26	More on .dix and .lrx	8543 (+3281)	81.55 %	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary; Started sample Lexical selection rules;
Week 4 June 27-July 3	Focus on transfer rules	9432 (+889)	81.85 %	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary more;
Week 5 July 4-10	Test translator and expand more	11008 (+1576)	82.81%	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary; Collected texts for lexical selection rules, tried a small script; Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
Week 6 July 11-17	Focus more on transfer rules	+191	84.46 %	71.83% / 63.52%	67.24% / 60.31%	Addeed more stems to the bidix to reach 85% coverage Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations Prepared 200 sentences of parallel corpora.
Week 7 July 18-24	Test the kaz-uzb translator	+219	84.58 %	69.21% / 60.69%	67.10% / 59.89%	Started the transfer rules Some more bidix Some Lexical selection rules
Week 8 July 25-31	Focus on transfer rules	+500	85.17 %	69.21% / 60.69%	67.10% / 59.89%	Some additions to apertium-kaz & apertium-uzb Some transfer rules Some more lexical selection rules & bidix
Week 9 August 1-7	Focus on testvoc	+1000	86.02 %	71.37% / 64.81%	61.93% / 56.05%	More frequent words added to the dictionary
Week 10 August 8-14	Finalize work	+400	86.02 %	71.37% / 64.81%	61.93% / 56.05%	More post-editing done

TODO[edit]

More transfer rules
Testvoc

ONGOING[edit]

Transfer rules
Some lexical selection rules
Collecting more bidix
- Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached

DONE (+Notes & Comments)[edit]

Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
Translating big Kaz text into Uzb
- Aside from those small texts obtained from parallel corpora
- For better WER/PER calculation
- For checking transfer rules
- Started with readily available small JaM Story
  - Azmat and Oygul story in our case
  - Added 47 sentences from that.
- Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
  - Split the article into sentences, translated into uzb manually.
  - Added 112 sentences out of Nur-Sultan text.
- Used QED parallel corpus form internet.
  - Manually selected sentences and corrected their translations.
  - Added 90 sentences from this.
- Made 250 sentences in total, stopping this here.
Tried finding parallel corpora:
- https://opus.nlpl.eu/
  - KDE4: Not good: https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html
  - MozillaI10: Nope
  - QED:
    - kk-uz.tmx file
    - Good, Gonna use this, but needs manual selection and corrections on translations.
  - TED: the same file with QED.
  - The rest is also not good, just one-word translation mostly.
Lexical selection rules for kaz-uzb
- Created a script to analyse bidix and find ambiguous stems.
- Created a script that generates lexical selection rules from manually chosen stems
- Rules added in kaz-uzb.lrx for 1565 words.
- Rules added in uzb-kaz.lrx for 1295 words.
Apertium-dixtools:
- https://wiki.apertium.org/wiki/Apertium-dixtools
- Fixing bidix(deduplicating, removing empty lines):

   apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed

- Sorting bidix(aligning too):

   apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted

Sorting+Deduplicating the bidix for future purposes:
- Bidix count before deduplication: 11008
- Fixing the bidix:
  - Removed 1501 entries in section main
- Bidix count after deduplication: 9507
- Sorted+Regrouped the dix.
Made a script to calculate DixCount, Coverage, WER/PER at once
Calculating WER/PER:
- Apertium-eval-translator:
  - https://github.com/apertium/apertium-eval-translator

   $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
   $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
   $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
   $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt

- Parallel text:
  - JaM Story:
  - “Azamat va Oygul” in our case;
  - kaz-uzb/texts/[kaz|uzb].txt
- Astana article from Kazakh Wiki
Calculated dix Coverage:
- (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
- coverage: 26752095 / 32305875 (~0.82808761564266561423)
- remaining unknown forms: 5553780
- kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
Counting Dix elements:
- Apertium-Eval: dixcounter.py:

   python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix

- July 09: 11008 dix elements before deduplication.
Translating kaz-uig.dix into kaz-uzb
Translating kaz-kaa.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
Translating kaz-tur.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing turkish translation into uzbek one by looking at both kazakh and turkish words
- Added 3200 more words from this.
Extract Kazakh wikipedia:
- Kazakh wiki date: 01.05.2021
  - https://dumps.wikimedia.org/kkwiki/20210501/
- Apertium-tools: WikiExtractor:
  - Wiki: https://wiki.apertium.org/wiki/WikiExtractor
  - Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

   python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2

- Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
Big Crossdix action:
- Crossing dictionaries: kaz-kaa and kaa-uzb
- Crossing dictionaries: kaz-tur and tur-uzb
- Merge the obtained crossdix outputs from two
- Sort the merged file, removing duplicates
- Align the result for better visibility
- Manually check every translation, remove if bad, correct/add if necessary
- Added 5000+ words from this.
Start collecting the kaz-uzb bilingual dictionary
Convert the pair to apertium-recursive
- Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
Write some lexical selection rules in kaz-uzb.
- Wiki: https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules
- Definitely needed samples:
  - The closest sample is apertium-kaz-tur
  - https://github.com/apertium/apertium-kaz-tur
Translate small text (James & Mary story)
- Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
- James&Mary story was downloaded from the source.
- apertium-kaz-uzb/texts/
- Updated as Azamat & Oygul story
Translating a text file:

   cat texts/kaz.txt | apertium -d . -f line kaz-uzb

Translation of a sentence:

   echo 'Сәлем Әлем' | apertium -d . kaz-uzb

Bootstrapping a new language pair apertium-kaz-uzb
- Installed Apertium-init from pip
  - Had a problem, solved it (tanks to @popcorndude).
- Downloaded apertium-init.py
  - Did not work

   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
   cd apertium-kaz-uzb
   ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make

- I had to convert it to apertium-recursive, so further flags to be added:
  - -t rtx
Final initialization command:

   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx

Forked(&Installed) necessary repos on GitHub:
- Apertium-kaz
  - git clone git@github.com:kamush901/apertium-kaz.git
  - Works well
- Apertium-uzb
  - git@github.com:kamush901/apertium-uzb.git
  - Works well
Installed Apertium and necessary tools:
- Installed Apertium core using packaging

   wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
   sudo apt-get -f install apertium-all-dev

Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

Latest revision as of 11:51, 31 August 2021

Contents

Progress Report[edit]

TODO[edit]

ONGOING[edit]

DONE (+Notes & Comments)[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+==Progress Report==
-Report is coming really soon
+{| class="wikitable" border="1"
+!rowspan="2"|Time Period
+!rowspan="2"|Goal
+!Bidix
+!Coverage
+!colspan="2"|WER,PER
+!rowspan="2"|Details/Comments
+|-
+!kaz-uzb
+!kaz-uzb
+!kaz-uzb
+!uzb-kaz
+|-
+|Community Bonding Period
+May 17-June 5
+|
+* Installed Apertium
+* Initialize kaz-uzb pair
+* Collect data in both languages
+| style = "text-align: center;" | 426
+(+426)
+| style = "text-align: center;" | 43.80 %
+| -
+| -
+|
+* Installed Apertium and necessary tools;
+* Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair
+* Translated a small sample text;
+* Extracted Uzbek and Kazakh wiki corpus;
+* Collected Kazakh-Uzbek dictionary and parallel corpora;
+|-
+|Week 1
+June 6-12
+|Make Uzbek better
+| style = "text-align: center;" | 2220
+(+1794)
+| style = "text-align: center;" | 52.11 %
+| -
+| -
+|
+* Went through all Uzbek and Kazakh stems;
+* Initialized the pair with apertium-recursive;
+* Collected dictionaries from other pairs for crossdic;
+* Obtained crossdic results from two ways.
+|-
+|Week 2
+June 13-19
+| Expand bilingual dictionary
+| style = "text-align: center;" | 5262
+(+3042)
+| style = "text-align: center;" | 77.03 %
+| 74.77% / 67.57%
+| 64.23% / 54.37%
+|
+* Started adding bilingual dictionary elements;
+|-
+|Week 3
+June 20-26
+| More on .dix and .lrx
+| style = "text-align: center;" | 8543
+(+3281)
+| style = "text-align: center;" | 81.55 %
+| 74.77% / 67.57%
+| 64.23% / 54.37%
+|
+* Expanded bilingual dictionary;
+* Started sample Lexical selection rules;
+|-
+|Week 4
+June 27-July 3
+|Focus on transfer rules
+| style = "text-align: center;" | 9432
+(+889)
+| style = "text-align: center;" | 81.85 %
+| 74.77% / 67.57%
+| 64.23% / 54.37%
+|
+* Expanded bilingual dictionary more;
+|-
+|Week 5
+July 4-10
+|Test translator and expand more
+| style = "text-align: center;" | 11008
+(+1576)
+| style = "text-align: center;" | 82.81%
+| 74.77% / 67.57%
+| 64.23% / 54.37%
+|
+* Expanded bilingual dictionary;
+* Collected texts for lexical selection rules, tried a small script;
+* Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
+|-
+|Week 6
+July 11-17
+|Focus more on transfer rules
+| style = "text-align: center;" | +191
+| style = "text-align: center;" | 84.46 %
+| 71.83% / 63.52%
+| 67.24% / 60.31%
+|
+* Addeed more stems to the bidix to reach 85% coverage
+* Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations
+* Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations
+* Prepared 200 sentences of parallel corpora.
+|-
+|Week 7
+July 18-24
+|Test the kaz-uzb translator
+| style = "text-align: center;" | +219
+| style = "text-align: center;" | 84.58 %
+| 69.21% / 60.69%
+| 67.10% / 59.89%
+|
+* Started the transfer rules
+* Some more bidix
+* Some Lexical selection rules
+|-
+|Week 8
+July 25-31
+|Focus on transfer rules
+| style = "text-align: center;" | +500
+| style = "text-align: center;" | 85.17 %
+| 69.21% / 60.69%
+| 67.10% / 59.89%
+|
+* Some additions to apertium-kaz & apertium-uzb
+* Some transfer rules
+* Some more lexical selection rules & bidix
+|-
+|Week 9
+August 1-7
+|Focus on testvoc
+| style = "text-align: center;" | +1000
+| style = "text-align: center;" | 86.02 %
+| 71.37% / 64.81%
+| 61.93% / 56.05%
+|
+* More frequent words added to the dictionary
+|-
+|Week 10
+August 8-14
+|Finalize work
+| style = "text-align: center;" | +400
+| style = "text-align: center;" | 86.02 %
+| 71.37% / 64.81%
+| 61.93% / 56.05%
+|
+* More post-editing done
+|-
+|}
+== TODO ==
+* More transfer rules
+* Testvoc
+== ONGOING ==
+* Transfer rules
+* Some lexical selection rules
+* Collecting more bidix
+** Adding only most-frequent words from kazakh wiki hitparade until  85% coverage is reached
+== DONE (+Notes & Comments) ==
+* Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
+* Translating big Kaz text into Uzb
+** Aside from those small texts obtained from parallel corpora
+** For better WER/PER calculation
+** For checking transfer rules
+** Started with readily available small JaM Story
+*** Azmat and Oygul story in our case
+*** Added 47 sentences from that.
+** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
+*** Split the article into sentences, translated into uzb manually.
+*** Added 112 sentences out of Nur-Sultan text.
+** Used QED parallel corpus form internet.
+*** Manually selected sentences and corrected their translations.
+*** Added 90 sentences from this.
+** Made 250 sentences in total, stopping this here.
+* Tried finding parallel corpora:
+** ''https://opus.nlpl.eu/''
+*** KDE4: Not good: ''https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html''
+*** MozillaI10: Nope
+*** '''QED''':
+**** kk-uz.tmx file
+**** Good, Gonna use this, but needs manual selection and corrections on translations.
+*** TED: the same file with QED.
+*** The rest is also not good, just one-word translation mostly.
+* Lexical selection rules for kaz-uzb
+** Created a script to analyse bidix and find ambiguous stems.
+** Created a script that generates lexical selection rules from manually chosen stems
+** Rules added in kaz-uzb.lrx for 1565 words.
+** Rules added in uzb-kaz.lrx for 1295 words.
+* '''Apertium-dixtools:'''
+** ''https://wiki.apertium.org/wiki/Apertium-dixtools''
+** Fixing bidix(deduplicating, removing empty lines):
+    apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
+** Sorting bidix(aligning too):
+    apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
+* Sorting+Deduplicating the bidix for future purposes:
+** Bidix count before deduplication: 11008
+** Fixing the bidix:
+*** Removed 1501 entries in section main
+** Bidix count after deduplication: 9507
+** Sorted+Regrouped the dix.
+* Made a script to calculate DixCount, Coverage, WER/PER at once
+* '''Calculating WER/PER:'''
+** Apertium-eval-translator:
+*** ''https://github.com/apertium/apertium-eval-translator''
+    $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
+    $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
+    $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
+    $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
+** '''Parallel text:'''
+*** JaM Story:
+*** “Azamat va Oygul” in our case;
+*** kaz-uzb/texts/[kaz|uzb].txt
+** Astana article from Kazakh Wiki
+* '''Calculated dix Coverage:'''
+** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
+** coverage: 26752095 / 32305875 (~0.82808761564266561423)
+** remaining unknown forms: 5553780
+** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
+* '''Counting Dix elements:'''
+** Apertium-Eval: dixcounter.py:
+    python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
+** July 09: 11008 dix elements before deduplication.
+* Translating kaz-uig.dix into kaz-uzb
+* Translating kaz-kaa.dix into kaz-uzb.dix
+** Removing those that were already done by crossdic
+** Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
+* Translating kaz-tur.dix into kaz-uzb.dix
+** Removing those that were already done by crossdic
+** Changing turkish translation into uzbek one by looking at both kazakh and turkish words
+** Added 3200 more words from this.
+* '''Extract Kazakh wikipedia:'''
+** Kazakh wiki date: 01.05.2021
+*** https://dumps.wikimedia.org/kkwiki/20210501/
+** Apertium-tools: WikiExtractor:
+*** Wiki: ''https://wiki.apertium.org/wiki/WikiExtractor''
+*** Code: ''https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py''
+    python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
+** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
+* '''Big Crossdix action:'''
+** Crossing dictionaries: kaz-kaa and kaa-uzb
+** Crossing dictionaries: kaz-tur and tur-uzb
+** Merge the obtained crossdix outputs from two
+** Sort the merged file, removing duplicates
+** Align the result for better visibility
+** Manually check every translation, remove if bad, correct/add if necessary
+** Added 5000+ words from this.
+* Start collecting the kaz-uzb bilingual dictionary
+* '''Convert the pair to apertium-recursive'''
+** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
+* Write some lexical selection rules in kaz-uzb.
+** Wiki: ''https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules''
+** Definitely needed samples:
+*** The closest sample is apertium-kaz-tur
+*** ''https://github.com/apertium/apertium-kaz-tur''
+* '''Translate small text (James & Mary story)'''
+** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
+** James&Mary story was downloaded from the source.
+** apertium-kaz-uzb/texts/
+** Updated as Azamat & Oygul story
+* Translating a text file:
+    cat texts/kaz.txt | apertium -d . -f line kaz-uzb
+* Translation of a sentence:
+    echo 'Сәлем Әлем' | apertium -d . kaz-uzb
+* Bootstrapping a new language pair apertium-kaz-uzb
+** Installed Apertium-init from pip
+*** Had a problem, solved it (tanks to @popcorndude).
+** Downloaded apertium-init.py
+*** Did not work
+    python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
+    cd apertium-kaz-uzb
+    ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
+** I had to convert it to apertium-recursive, so further flags to be added:
+*** ''-t rtx''
+* Final initialization command:
+    python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
+* '''Forked(&Installed) necessary repos on GitHub:'''
+** '''''Apertium-kaz'''''
+*** ''git clone git@github.com:kamush901/apertium-kaz.git''
+*** Works well
+** '''''Apertium-uzb'''''
+*** ''git@github.com:kamush901/apertium-uzb.git''
+*** Works well
+* '''Installed Apertium and necessary tools:'''
+** Installed Apertium core using packaging
+    wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
+    sudo apt-get -f install apertium-all-dev