Difference between revisions of "User:Kamush/GSoC2021ProgresReport"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
| (21 intermediate revisions by the same user not shown) | |||
| Line 96: | Line 96: | ||
July 11-17  | 
  July 11-17  | 
||
|Focus more on transfer rules  | 
  |Focus more on transfer rules  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | +191  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | 84.46 %  | 
||
| 71.83% / 63.52%  | 
|||
| -  | 
  |||
| 67.24% / 60.31%  | 
|||
| -  | 
  |||
|  | 
  |  | 
||
* Addeed more stems to the bidix to reach 85% coverage  | 
|||
* Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations  | 
|||
* Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations  | 
|||
* Prepared 200 sentences of parallel corpora.  | 
|||
|-  | 
  |-  | 
||
|Week 7  | 
  |Week 7  | 
||
July 18-24  | 
  July 18-24  | 
||
|Test the kaz-uzb translator  | 
  |Test the kaz-uzb translator  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | +219  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | 84.58 %  | 
||
| 69.21% / 60.69%  | 
|||
| -  | 
  |||
| 67.10% / 59.89%  | 
|||
| -  | 
  |||
|  | 
  |  | 
||
* Started the transfer rules  | 
|||
* Some more bidix  | 
|||
* Some Lexical selection rules  | 
|||
|-  | 
  |-  | 
||
|Week 8  | 
  |Week 8  | 
||
July 25-31  | 
  July 25-31  | 
||
|Focus on transfer rules  | 
  |Focus on transfer rules  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | +500  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | 85.17 %  | 
||
| 69.21% / 60.69%  | 
|||
| -  | 
  |||
| 67.10% / 59.89%  | 
|||
| -  | 
  |||
|  | 
  |  | 
||
* Some additions to apertium-kaz & apertium-uzb  | 
|||
* Some transfer rules  | 
|||
* Some more lexical selection rules & bidix  | 
|||
|-  | 
  |-  | 
||
|Week 9  | 
  |Week 9  | 
||
August 1-7  | 
  August 1-7  | 
||
|Focus on testvoc  | 
  |Focus on testvoc  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | +1000  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | 86.02 %  | 
||
| 71.37% / 64.81%  | 
|||
| -  | 
  |||
| 61.93% / 56.05%  | 
|||
| -  | 
  |||
|  | 
  |  | 
||
* More frequent words added to the dictionary  | 
|||
|-  | 
  |-  | 
||
|Week 10  | 
  |Week 10  | 
||
August 8-14  | 
  August 8-14  | 
||
|Finalize work  | 
  |Finalize work  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | +400  | 
||
| style = "text-align: center;" |   | 
  | style = "text-align: center;" | 86.02 %  | 
||
| 71.37% / 64.81%  | 
|||
| -  | 
  |||
| 61.93% / 56.05%  | 
|||
| -  | 
  |||
|  | 
  |  | 
||
* More post-editing done  | 
|||
|-  | 
  |-  | 
||
|}  | 
  |}  | 
||
== TODO ==  | 
  == TODO ==  | 
||
* More transfer rules  | 
|||
* Writing lexical selection rules for uzb-kaz  | 
  |||
* Transfer rules  | 
  |||
* Testvoc  | 
  * Testvoc  | 
||
== ONGOING ==  | 
  == ONGOING ==  | 
||
*   | 
  * Transfer rules  | 
||
* Some lexical selection rules  | 
|||
* Collecting more bidix  | 
|||
** Adding only most-frequent words from kazakh wiki hitparade until  85% coverage is reached  | 
|||
== DONE (+Notes & Comments) ==  | 
|||
* Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.  | 
|||
* Translating big Kaz text into Uzb  | 
  * Translating big Kaz text into Uzb  | 
||
** Aside from those small texts obtained from parallel corpora  | 
|||
** For better WER/PER calculation  | 
  ** For better WER/PER calculation  | 
||
** For checking transfer rules  | 
  ** For checking transfer rules  | 
||
** Started with readily available small JaM Story  | 
|||
*** Azmat and Oygul story in our case  | 
|||
*** Added 47 sentences from that.  | 
|||
** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.  | 
  ** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.  | 
||
**   | 
  *** Split the article into sentences, translated into uzb manually.  | 
||
*** Added 112 sentences out of Nur-Sultan text.  | 
|||
* Collecting more bidix  | 
  |||
** Used QED parallel corpus form internet.  | 
|||
*** Manually selected sentences and corrected their translations.  | 
|||
*** Added 90 sentences from this.   | 
|||
** Made 250 sentences in total, stopping this here.   | 
|||
* Tried finding parallel corpora:  | 
|||
** ''https://opus.nlpl.eu/''  | 
|||
*** KDE4: Not good: ''https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html''  | 
|||
*** MozillaI10: Nope  | 
|||
*** '''QED''':   | 
|||
**** kk-uz.tmx file  | 
|||
**** Good, Gonna use this, but needs manual selection and corrections on translations.  | 
|||
*** TED: the same file with QED.  | 
|||
*** The rest is also not good, just one-word translation mostly.  | 
|||
* Lexical selection rules for kaz-uzb  | 
|||
** Created a script to analyse bidix and find ambiguous stems.  | 
|||
** Created a script that generates lexical selection rules from manually chosen stems   | 
|||
** Rules added in kaz-uzb.lrx for 1565 words.  | 
|||
** Rules added in uzb-kaz.lrx for 1295 words.  | 
|||
* '''Apertium-dixtools:'''  | 
|||
** ''https://wiki.apertium.org/wiki/Apertium-dixtools''  | 
|||
** Fixing bidix(deduplicating, removing empty lines):  | 
|||
    apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed  | 
|||
** Sorting bidix(aligning too):  | 
|||
    apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted  | 
|||
* Sorting+Deduplicating the bidix for future purposes:  | 
|||
** Bidix count before deduplication: 11008  | 
|||
** Fixing the bidix:  | 
|||
*** Removed 1501 entries in section main  | 
|||
** Bidix count after deduplication: 9507  | 
|||
** Sorted+Regrouped the dix.  | 
|||
* Made a script to calculate DixCount, Coverage, WER/PER at once  | 
|||
* '''Calculating WER/PER:'''  | 
|||
** Apertium-eval-translator:  | 
|||
*** ''https://github.com/apertium/apertium-eval-translator''  | 
|||
    $ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt  | 
|||
    $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt  | 
|||
    $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt  | 
|||
    $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt  | 
|||
** '''Parallel text:'''  | 
|||
*** JaM Story:   | 
|||
*** “Azamat va Oygul” in our case;  | 
|||
*** kaz-uzb/texts/[kaz|uzb].txt  | 
|||
** Astana article from Kazakh Wiki  | 
|||
* '''Calculated dix Coverage:'''  | 
|||
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin  | 
|||
** coverage: 26752095 / 32305875 (~0.82808761564266561423)  | 
|||
** remaining unknown forms: 5553780  | 
|||
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021  | 
|||
* '''Counting Dix elements:'''  | 
|||
** Apertium-Eval: dixcounter.py:  | 
|||
    python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix  | 
|||
** July 09: 11008 dix elements before deduplication.  | 
|||
* Translating kaz-uig.dix into kaz-uzb  | 
|||
* Translating kaz-kaa.dix into kaz-uzb.dix  | 
|||
** Removing those that were already done by crossdic  | 
|||
** Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words  | 
|||
* Translating kaz-tur.dix into kaz-uzb.dix  | 
|||
** Removing those that were already done by crossdic  | 
|||
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words  | 
|||
** Added 3200 more words from this.  | 
|||
* '''Extract Kazakh wikipedia:'''  | 
|||
** Kazakh wiki date: 01.05.2021  | 
|||
*** https://dumps.wikimedia.org/kkwiki/20210501/  | 
|||
** Apertium-tools: WikiExtractor:  | 
|||
*** Wiki: ''https://wiki.apertium.org/wiki/WikiExtractor''  | 
|||
*** Code: ''https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py''  | 
|||
    python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2  | 
|||
** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens  | 
|||
== DONE (NOTES) ==  | 
  |||
* '''Big Crossdix action:'''  | 
|||
** Crossing dictionaries: kaz-kaa and kaa-uzb  | 
|||
** Crossing dictionaries: kaz-tur and tur-uzb  | 
|||
** Merge the obtained crossdix outputs from two  | 
|||
** Sort the merged file, removing duplicates  | 
|||
** Align the result for better visibility  | 
|||
** Manually check every translation, remove if bad, correct/add if necessary  | 
|||
** Added 5000+ words from this.  | 
|||
* Start collecting the kaz-uzb bilingual dictionary  | 
|||
* '''Convert the pair to apertium-recursive'''  | 
|||
** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.  | 
|||
* Write some lexical selection rules in kaz-uzb.  | 
|||
** Wiki: ''https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules''  | 
|||
** Definitely needed samples:  | 
|||
*** The closest sample is apertium-kaz-tur  | 
|||
*** ''https://github.com/apertium/apertium-kaz-tur''  | 
|||
* '''Translate small text (James & Mary story)'''  | 
|||
** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam   | 
|||
** James&Mary story was downloaded from the source.  | 
|||
** apertium-kaz-uzb/texts/    | 
|||
** Updated as Azamat & Oygul story  | 
|||
* Translating a text file:  | 
|||
    cat texts/kaz.txt | apertium -d . -f line kaz-uzb  | 
|||
* Translation of a sentence:  | 
|||
    echo 'Сәлем Әлем' | apertium -d . kaz-uzb  | 
|||
* Bootstrapping a new language pair apertium-kaz-uzb  | 
|||
** Installed Apertium-init from pip  | 
|||
*** Had a problem, solved it (tanks to @popcorndude).  | 
|||
** Downloaded apertium-init.py  | 
|||
*** Did not work  | 
|||
    python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2  | 
|||
    cd apertium-kaz-uzb  | 
|||
    ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make  | 
|||
** I had to convert it to apertium-recursive, so further flags to be added:  | 
|||
*** ''-t rtx''  | 
|||
* Final initialization command:  | 
|||
    python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx  | 
|||
* '''Forked(&Installed) necessary repos on GitHub:'''  | 
|||
** '''''Apertium-kaz'''''  | 
|||
*** ''git clone git@github.com:kamush901/apertium-kaz.git''  | 
|||
*** Works well  | 
|||
** '''''Apertium-uzb'''''  | 
|||
*** ''git@github.com:kamush901/apertium-uzb.git''  | 
|||
*** Works well  | 
|||
* '''Installed Apertium and necessary tools:'''  | 
|||
** Installed Apertium core using packaging  | 
|||
    wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash  | 
|||
    sudo apt-get -f install apertium-all-dev  | 
|||
Latest revision as of 11:51, 31 August 2021
Progress Report[edit]
| Time Period | Goal | Bidix | Coverage | WER,PER | Details/Comments | |
|---|---|---|---|---|---|---|
| kaz-uzb | kaz-uzb | kaz-uzb | uzb-kaz | |||
| Community Bonding Period
 May 17-June 5  | 
  | 
426
 (+426)  | 
43.80 % | - | - | 
  | 
| Week 1
 June 6-12  | 
Make Uzbek better | 2220
 (+1794)  | 
52.11 % | - | - | 
  | 
| Week 2
 June 13-19  | 
Expand bilingual dictionary | 5262
 (+3042)  | 
77.03 % | 74.77% / 67.57% | 64.23% / 54.37% | 
  | 
| Week 3
 June 20-26  | 
More on .dix and .lrx | 8543
 (+3281)  | 
81.55 % | 74.77% / 67.57% | 64.23% / 54.37% | 
  | 
| Week 4
 June 27-July 3  | 
Focus on transfer rules | 9432
 (+889)  | 
81.85 % | 74.77% / 67.57% | 64.23% / 54.37% | 
  | 
| Week 5
 July 4-10  | 
Test translator and expand more | 11008
 (+1576)  | 
82.81% | 74.77% / 67.57% | 64.23% / 54.37% | 
  | 
| Week 6
 July 11-17  | 
Focus more on transfer rules | +191 | 84.46 % | 71.83% / 63.52% | 67.24% / 60.31% | 
  | 
| Week 7
 July 18-24  | 
Test the kaz-uzb translator | +219 | 84.58 % | 69.21% / 60.69% | 67.10% / 59.89% | 
  | 
| Week 8
 July 25-31  | 
Focus on transfer rules | +500 | 85.17 % | 69.21% / 60.69% | 67.10% / 59.89% | 
  | 
| Week 9
 August 1-7  | 
Focus on testvoc | +1000 | 86.02 % | 71.37% / 64.81% | 61.93% / 56.05% | 
  | 
| Week 10
 August 8-14  | 
Finalize work | +400 | 86.02 % | 71.37% / 64.81% | 61.93% / 56.05% | 
  | 
TODO[edit]
- More transfer rules
 - Testvoc
 
ONGOING[edit]
- Transfer rules
 - Some lexical selection rules
 - Collecting more bidix
- Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached
 
 
DONE (+Notes & Comments)[edit]
- Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
 - Translating big Kaz text into Uzb
- Aside from those small texts obtained from parallel corpora
 - For better WER/PER calculation
 - For checking transfer rules
 - Started with readily available small JaM Story
- Azmat and Oygul story in our case
 - Added 47 sentences from that.
 
 - Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
- Split the article into sentences, translated into uzb manually.
 - Added 112 sentences out of Nur-Sultan text.
 
 - Used QED parallel corpus form internet.
- Manually selected sentences and corrected their translations.
 - Added 90 sentences from this.
 
 - Made 250 sentences in total, stopping this here.
 
 - Tried finding parallel corpora:
- https://opus.nlpl.eu/
- KDE4: Not good: https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html
 - MozillaI10: Nope
 - QED:
- kk-uz.tmx file
 - Good, Gonna use this, but needs manual selection and corrections on translations.
 
 - TED: the same file with QED.
 - The rest is also not good, just one-word translation mostly.
 
 
 - https://opus.nlpl.eu/
 - Lexical selection rules for kaz-uzb
- Created a script to analyse bidix and find ambiguous stems.
 - Created a script that generates lexical selection rules from manually chosen stems
 - Rules added in kaz-uzb.lrx for 1565 words.
 - Rules added in uzb-kaz.lrx for 1295 words.
 
 - Apertium-dixtools:
- https://wiki.apertium.org/wiki/Apertium-dixtools
 - Fixing bidix(deduplicating, removing empty lines):
 
 
apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
- Sorting bidix(aligning too):
 
apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
- Sorting+Deduplicating the bidix for future purposes:
- Bidix count before deduplication: 11008
 - Fixing the bidix:
- Removed 1501 entries in section main
 
 - Bidix count after deduplication: 9507
 - Sorted+Regrouped the dix.
 
 - Made a script to calculate DixCount, Coverage, WER/PER at once
 - Calculating WER/PER:
- Apertium-eval-translator:
 
 
$ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
- Parallel text:
- JaM Story:
 - “Azamat va Oygul” in our case;
 - kaz-uzb/texts/[kaz|uzb].txt
 
 - Astana article from Kazakh Wiki
 
- Parallel text:
 - Calculated dix Coverage:
- (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
 - coverage: 26752095 / 32305875 (~0.82808761564266561423)
 - remaining unknown forms: 5553780
 - kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
 
 - Counting Dix elements:
- Apertium-Eval: dixcounter.py:
 
 
python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
- July 09: 11008 dix elements before deduplication.
 
- Translating kaz-uig.dix into kaz-uzb
 - Translating kaz-kaa.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
 - Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
 
 - Translating kaz-tur.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
 - Changing turkish translation into uzbek one by looking at both kazakh and turkish words
 - Added 3200 more words from this.
 
 - Extract Kazakh wikipedia:
- Kazakh wiki date: 01.05.2021
 - Apertium-tools: WikiExtractor:
 
 
python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
- Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
 
- Big Crossdix action:
- Crossing dictionaries: kaz-kaa and kaa-uzb
 - Crossing dictionaries: kaz-tur and tur-uzb
 - Merge the obtained crossdix outputs from two
 - Sort the merged file, removing duplicates
 - Align the result for better visibility
 - Manually check every translation, remove if bad, correct/add if necessary
 - Added 5000+ words from this.
 
 - Start collecting the kaz-uzb bilingual dictionary
 - Convert the pair to apertium-recursive
- Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
 
 - Write some lexical selection rules in kaz-uzb.
- Wiki: https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules
 - Definitely needed samples:
- The closest sample is apertium-kaz-tur
 - https://github.com/apertium/apertium-kaz-tur
 
 
 - Translate small text (James & Mary story)
- Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
 - James&Mary story was downloaded from the source.
 - apertium-kaz-uzb/texts/
 - Updated as Azamat & Oygul story
 
 - Translating a text file:
 
cat texts/kaz.txt | apertium -d . -f line kaz-uzb
- Translation of a sentence:
 
echo 'Сәлем Әлем' | apertium -d . kaz-uzb
- Bootstrapping a new language pair apertium-kaz-uzb
- Installed Apertium-init from pip
- Had a problem, solved it (tanks to @popcorndude).
 
 - Downloaded apertium-init.py
- Did not work
 
 
 - Installed Apertium-init from pip
 
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 cd apertium-kaz-uzb ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
- I had to convert it to apertium-recursive, so further flags to be added:
- -t rtx
 
 
- I had to convert it to apertium-recursive, so further flags to be added:
 - Final initialization command:
 
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
- Forked(&Installed) necessary repos on GitHub:
- Apertium-kaz
- git clone git@github.com:kamush901/apertium-kaz.git
 - Works well
 
 - Apertium-uzb
- git@github.com:kamush901/apertium-uzb.git
 - Works well
 
 
 - Apertium-kaz
 - Installed Apertium and necessary tools:
- Installed Apertium core using packaging
 
 
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash sudo apt-get -f install apertium-all-dev