Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

Latest revision as of 11:51, 31 August 2021

Progress Report[edit]

Time Period	Goal	Bidix	Coverage	WER,PER		Details/Comments
Time Period	Goal	kaz-uzb	kaz-uzb	kaz-uzb	uzb-kaz	Details/Comments
Community Bonding Period May 17-June 5	Installed Apertium Initialize kaz-uzb pair Collect data in both languages	426 (+426)	43.80 %	-	-	Installed Apertium and necessary tools; Cloned Apertium-kaz and apertium-uzb, initialized the kaz-uzb pair Translated a small sample text; Extracted Uzbek and Kazakh wiki corpus; Collected Kazakh-Uzbek dictionary and parallel corpora;
Week 1 June 6-12	Make Uzbek better	2220 (+1794)	52.11 %	-	-	Went through all Uzbek and Kazakh stems; Initialized the pair with apertium-recursive; Collected dictionaries from other pairs for crossdic; Obtained crossdic results from two ways.
Week 2 June 13-19	Expand bilingual dictionary	5262 (+3042)	77.03 %	74.77% / 67.57%	64.23% / 54.37%	Started adding bilingual dictionary elements;
Week 3 June 20-26	More on .dix and .lrx	8543 (+3281)	81.55 %	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary; Started sample Lexical selection rules;
Week 4 June 27-July 3	Focus on transfer rules	9432 (+889)	81.85 %	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary more;
Week 5 July 4-10	Test translator and expand more	11008 (+1576)	82.81%	74.77% / 67.57%	64.23% / 54.37%	Expanded bilingual dictionary; Collected texts for lexical selection rules, tried a small script; Translated a Big Kazkh text into Uzbek for better WER/PER calculation.
Week 6 July 11-17	Focus more on transfer rules	+191	84.46 %	71.83% / 63.52%	67.24% / 60.31%	Addeed more stems to the bidix to reach 85% coverage Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations Prepared 200 sentences of parallel corpora.
Week 7 July 18-24	Test the kaz-uzb translator	+219	84.58 %	69.21% / 60.69%	67.10% / 59.89%	Started the transfer rules Some more bidix Some Lexical selection rules
Week 8 July 25-31	Focus on transfer rules	+500	85.17 %	69.21% / 60.69%	67.10% / 59.89%	Some additions to apertium-kaz & apertium-uzb Some transfer rules Some more lexical selection rules & bidix
Week 9 August 1-7	Focus on testvoc	+1000	86.02 %	71.37% / 64.81%	61.93% / 56.05%	More frequent words added to the dictionary
Week 10 August 8-14	Finalize work	+400	86.02 %	71.37% / 64.81%	61.93% / 56.05%	More post-editing done

TODO[edit]

More transfer rules
Testvoc

ONGOING[edit]

Transfer rules
Some lexical selection rules
Collecting more bidix
- Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached

DONE (+Notes & Comments)[edit]

Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
Translating big Kaz text into Uzb
- Aside from those small texts obtained from parallel corpora
- For better WER/PER calculation
- For checking transfer rules
- Started with readily available small JaM Story
  - Azmat and Oygul story in our case
  - Added 47 sentences from that.
- Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
  - Split the article into sentences, translated into uzb manually.
  - Added 112 sentences out of Nur-Sultan text.
- Used QED parallel corpus form internet.
  - Manually selected sentences and corrected their translations.
  - Added 90 sentences from this.
- Made 250 sentences in total, stopping this here.
Tried finding parallel corpora:
- https://opus.nlpl.eu/
  - KDE4: Not good: https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html
  - MozillaI10: Nope
  - QED:
    - kk-uz.tmx file
    - Good, Gonna use this, but needs manual selection and corrections on translations.
  - TED: the same file with QED.
  - The rest is also not good, just one-word translation mostly.
Lexical selection rules for kaz-uzb
- Created a script to analyse bidix and find ambiguous stems.
- Created a script that generates lexical selection rules from manually chosen stems
- Rules added in kaz-uzb.lrx for 1565 words.
- Rules added in uzb-kaz.lrx for 1295 words.
Apertium-dixtools:
- https://wiki.apertium.org/wiki/Apertium-dixtools
- Fixing bidix(deduplicating, removing empty lines):

   apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed

- Sorting bidix(aligning too):

   apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted

Sorting+Deduplicating the bidix for future purposes:
- Bidix count before deduplication: 11008
- Fixing the bidix:
  - Removed 1501 entries in section main
- Bidix count after deduplication: 9507
- Sorted+Regrouped the dix.
Made a script to calculate DixCount, Coverage, WER/PER at once
Calculating WER/PER:
- Apertium-eval-translator:
  - https://github.com/apertium/apertium-eval-translator

   $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
   $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
   $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
   $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt

- Parallel text:
  - JaM Story:
  - “Azamat va Oygul” in our case;
  - kaz-uzb/texts/[kaz|uzb].txt
- Astana article from Kazakh Wiki
Calculated dix Coverage:
- (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
- coverage: 26752095 / 32305875 (~0.82808761564266561423)
- remaining unknown forms: 5553780
- kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
Counting Dix elements:
- Apertium-Eval: dixcounter.py:

   python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix

- July 09: 11008 dix elements before deduplication.
Translating kaz-uig.dix into kaz-uzb
Translating kaz-kaa.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
Translating kaz-tur.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing turkish translation into uzbek one by looking at both kazakh and turkish words
- Added 3200 more words from this.
Extract Kazakh wikipedia:
- Kazakh wiki date: 01.05.2021
  - https://dumps.wikimedia.org/kkwiki/20210501/
- Apertium-tools: WikiExtractor:
  - Wiki: https://wiki.apertium.org/wiki/WikiExtractor
  - Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py

   python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2

- Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
Big Crossdix action:
- Crossing dictionaries: kaz-kaa and kaa-uzb
- Crossing dictionaries: kaz-tur and tur-uzb
- Merge the obtained crossdix outputs from two
- Sort the merged file, removing duplicates
- Align the result for better visibility
- Manually check every translation, remove if bad, correct/add if necessary
- Added 5000+ words from this.
Start collecting the kaz-uzb bilingual dictionary
Convert the pair to apertium-recursive
- Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
Write some lexical selection rules in kaz-uzb.
- Wiki: https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules
- Definitely needed samples:
  - The closest sample is apertium-kaz-tur
  - https://github.com/apertium/apertium-kaz-tur
Translate small text (James & Mary story)
- Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
- James&Mary story was downloaded from the source.
- apertium-kaz-uzb/texts/
- Updated as Azamat & Oygul story
Translating a text file:

   cat texts/kaz.txt | apertium -d . -f line kaz-uzb

Translation of a sentence:

   echo 'Сәлем Әлем' | apertium -d . kaz-uzb

Bootstrapping a new language pair apertium-kaz-uzb
- Installed Apertium-init from pip
  - Had a problem, solved it (tanks to @popcorndude).
- Downloaded apertium-init.py
  - Did not work

   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2
   cd apertium-kaz-uzb
   ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make

- I had to convert it to apertium-recursive, so further flags to be added:
  - -t rtx
Final initialization command:

   python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx

Forked(&Installed) necessary repos on GitHub:
- Apertium-kaz
  - git clone git@github.com:kamush901/apertium-kaz.git
  - Works well
- Apertium-uzb
  - git@github.com:kamush901/apertium-uzb.git
  - Works well
Installed Apertium and necessary tools:
- Installed Apertium core using packaging

   wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
   sudo apt-get -f install apertium-all-dev

@@ Line 97: / Line 97: @@
 |Focus more on transfer rules
 | style = "text-align: center;" | +191
-| style = "text-align: center;" | 84.28 %
+| style = "text-align: center;" | 84.46 %
+| 71.83% / 63.52%
-| -
+| 67.24% / 60.31%
-| -
 |
 * Addeed more stems to the bidix to reach 85% coverage
-* +1500 Lexical Selection Rules to kaz.lrx for ambiguous translations
+* Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations
+* Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations
-* Preparing big text(200 sentences) of parallel corpora.
+* Prepared 200 sentences of parallel corpora.
 |-
 |Week 7
 July 18-24
 |Test the kaz-uzb translator
-| style = "text-align: center;" | -
+| style = "text-align: center;" | +219
-| style = "text-align: center;" | -
+| style = "text-align: center;" | 84.58 %
+| 69.21% / 60.69%
-| -
+| 67.10% / 59.89%
-| -
-| -
+|
+* Started the transfer rules
+* Some more bidix
+* Some Lexical selection rules
 |-
 |Week 8
 July 25-31
 |Focus on transfer rules
-| style = "text-align: center;" | -
+| style = "text-align: center;" | +500
-| style = "text-align: center;" | -
+| style = "text-align: center;" | 85.17 %
+| 69.21% / 60.69%
-| -
+| 67.10% / 59.89%
-| -
-| -
+|
+* Some additions to apertium-kaz & apertium-uzb
+* Some transfer rules
+* Some more lexical selection rules & bidix
 |-
 |Week 9
 August 1-7
 |Focus on testvoc
-| style = "text-align: center;" | -
+| style = "text-align: center;" | +1000
-| style = "text-align: center;" | -
+| style = "text-align: center;" | 86.02 %
+| 71.37% / 64.81%
-| -
+| 61.93% / 56.05%
-| -
-| -
+|
+* More frequent words added to the dictionary
 |-
 |Week 10
 August 8-14
 |Finalize work
-| style = "text-align: center;" | -
+| style = "text-align: center;" | +400
-| style = "text-align: center;" | -
+| style = "text-align: center;" | 86.02 %
+| 71.37% / 64.81%
-| -
+| 61.93% / 56.05%
-| -
-| -
+|
+* More post-editing done
 |-
 |}
 == TODO ==
+* More transfer rules
-* Writing lexical selection rules for uzb-kaz
-* Transfer rules
 * Testvoc
 == ONGOING ==
-* Lexical selection rules for kaz-uzb
+* Transfer rules
+* Some lexical selection rules
+* Collecting more bidix
+** Adding only most-frequent words from kazakh wiki hitparade until  85% coverage is reached
+== DONE (+Notes & Comments) ==
+* Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
 * Translating big Kaz text into Uzb
+** Aside from those small texts obtained from parallel corpora
 ** For better WER/PER calculation
 ** For checking transfer rules
+** Started with readily available small JaM Story
+*** Azmat and Oygul story in our case
+*** Added 47 sentences from that.
 ** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
-** Made 112 sentences out of Nur-Sultan.
+*** Split the article into sentences, translated into uzb manually.
+*** Added 112 sentences out of Nur-Sultan text.
-* Collecting more bidix
+** Used QED parallel corpus form internet.
+*** Manually selected sentences and corrected their translations.
+*** Added 90 sentences from this.
-== DONE (+Notes & Comments) ==
+** Made 250 sentences in total, stopping this here.
+* Tried finding parallel corpora:
+** ''https://opus.nlpl.eu/''
+*** KDE4: Not good: ''https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html''
+*** MozillaI10: Nope
+*** '''QED''':
+**** kk-uz.tmx file
+**** Good, Gonna use this, but needs manual selection and corrections on translations.
+*** TED: the same file with QED.
+*** The rest is also not good, just one-word translation mostly.
+* Lexical selection rules for kaz-uzb
+** Created a script to analyse bidix and find ambiguous stems.
+** Created a script that generates lexical selection rules from manually chosen stems
+** Rules added in kaz-uzb.lrx for 1565 words.
+** Rules added in uzb-kaz.lrx for 1295 words.
+* '''Apertium-dixtools:'''
+** ''https://wiki.apertium.org/wiki/Apertium-dixtools''
+** Fixing bidix(deduplicating, removing empty lines):
+    apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
+** Sorting bidix(aligning too):
+    apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
+* Sorting+Deduplicating the bidix for future purposes:
+** Bidix count before deduplication: 11008
+** Fixing the bidix:
+*** Removed 1501 entries in section main
+** Bidix count after deduplication: 9507
+** Sorted+Regrouped the dix.
 * Made a script to calculate DixCount, Coverage, WER/PER at once
 * '''Calculating WER/PER:'''
 ** Apertium-eval-translator:
 *** ''https://github.com/apertium/apertium-eval-translator''
-    apertium-eval-translator -ref uzb.txt -test kaz-uzb.txt
+    $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt
+    $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt
+    $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt
+    $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
 ** '''Parallel text:'''
 *** JaM Story:

Difference between revisions of "User:Kamush/GSoC2021ProgresReport"

Latest revision as of 11:51, 31 August 2021

Contents

Progress Report[edit]

TODO[edit]

ONGOING[edit]

DONE (+Notes & Comments)[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools