Difference between revisions of "User:Kamush/GSoC2021ProgresReport"
Jump to navigation
Jump to search
(24 intermediate revisions by the same user not shown) | |||
Line 62: | Line 62: | ||
| style = "text-align: center;" | 8543 |
| style = "text-align: center;" | 8543 |
||
(+3281) |
(+3281) |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 81.55 % |
||
| 74.77% / 67.57% |
| 74.77% / 67.57% |
||
| 64.23% / 54.37% |
| 64.23% / 54.37% |
||
Line 74: | Line 74: | ||
| style = "text-align: center;" | 9432 |
| style = "text-align: center;" | 9432 |
||
(+889) |
(+889) |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 81.85 % |
||
| 74.77% / 67.57% |
| 74.77% / 67.57% |
||
| 64.23% / 54.37% |
| 64.23% / 54.37% |
||
Line 96: | Line 96: | ||
July 11-17 |
July 11-17 |
||
|Focus more on transfer rules |
|Focus more on transfer rules |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | +191 |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 84.46 % |
||
| 71.83% / 63.52% |
|||
| - |
|||
| 67.24% / 60.31% |
|||
| - |
|||
| |
| |
||
* Addeed more stems to the bidix to reach 85% coverage |
|||
* Added 1500+ Lexical Selection Rules to kaz-uzb.lrx for ambiguous translations |
|||
* Added 1200+ Lexical Selection Rules to uzb-kaz.lrx for ambiguous translations |
|||
* Prepared 200 sentences of parallel corpora. |
|||
|- |
|- |
||
|Week 7 |
|Week 7 |
||
July 18-24 |
July 18-24 |
||
|Test the kaz-uzb translator |
|Test the kaz-uzb translator |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | +219 |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 84.58 % |
||
| 69.21% / 60.69% |
|||
| - |
|||
| 67.10% / 59.89% |
|||
| - |
|||
| |
| |
||
* Started the transfer rules |
|||
* Some more bidix |
|||
* Some Lexical selection rules |
|||
|- |
|- |
||
|Week 8 |
|Week 8 |
||
July 25-31 |
July 25-31 |
||
|Focus on transfer rules |
|Focus on transfer rules |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | +500 |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 85.17 % |
||
| 69.21% / 60.69% |
|||
| - |
|||
| 67.10% / 59.89% |
|||
| - |
|||
| |
| |
||
* Some additions to apertium-kaz & apertium-uzb |
|||
* Some transfer rules |
|||
* Some more lexical selection rules & bidix |
|||
|- |
|- |
||
|Week 9 |
|Week 9 |
||
August 1-7 |
August 1-7 |
||
|Focus on testvoc |
|Focus on testvoc |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | +1000 |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 86.02 % |
||
| 71.37% / 64.81% |
|||
| - |
|||
| 61.93% / 56.05% |
|||
| - |
|||
| |
| |
||
* More frequent words added to the dictionary |
|||
|- |
|- |
||
|Week 10 |
|Week 10 |
||
August 8-14 |
August 8-14 |
||
|Finalize work |
|Finalize work |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | +400 |
||
| style = "text-align: center;" | |
| style = "text-align: center;" | 86.02 % |
||
| 71.37% / 64.81% |
|||
| - |
|||
| 61.93% / 56.05% |
|||
| - |
|||
| |
| |
||
* More post-editing done |
|||
|- |
|- |
||
|} |
|} |
||
== TODO == |
|||
* More transfer rules |
|||
* Testvoc |
|||
== ONGOING == |
|||
* Transfer rules |
|||
* Some lexical selection rules |
|||
* Collecting more bidix |
|||
** Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached |
|||
== DONE (+Notes & Comments) == |
|||
* Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems. |
|||
* Translating big Kaz text into Uzb |
|||
** Aside from those small texts obtained from parallel corpora |
|||
** For better WER/PER calculation |
|||
** For checking transfer rules |
|||
** Started with readily available small JaM Story |
|||
*** Azmat and Oygul story in our case |
|||
*** Added 47 sentences from that. |
|||
** Chose Nur-Sultan(capital city) article of Kazakh Wiki for that. |
|||
*** Split the article into sentences, translated into uzb manually. |
|||
*** Added 112 sentences out of Nur-Sultan text. |
|||
** Used QED parallel corpus form internet. |
|||
*** Manually selected sentences and corrected their translations. |
|||
*** Added 90 sentences from this. |
|||
** Made 250 sentences in total, stopping this here. |
|||
* Tried finding parallel corpora: |
|||
** ''https://opus.nlpl.eu/'' |
|||
*** KDE4: Not good: ''https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html'' |
|||
*** MozillaI10: Nope |
|||
*** '''QED''': |
|||
**** kk-uz.tmx file |
|||
**** Good, Gonna use this, but needs manual selection and corrections on translations. |
|||
*** TED: the same file with QED. |
|||
*** The rest is also not good, just one-word translation mostly. |
|||
* Lexical selection rules for kaz-uzb |
|||
** Created a script to analyse bidix and find ambiguous stems. |
|||
** Created a script that generates lexical selection rules from manually chosen stems |
|||
** Rules added in kaz-uzb.lrx for 1565 words. |
|||
** Rules added in uzb-kaz.lrx for 1295 words. |
|||
* '''Apertium-dixtools:''' |
|||
** ''https://wiki.apertium.org/wiki/Apertium-dixtools'' |
|||
** Fixing bidix(deduplicating, removing empty lines): |
|||
apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed |
|||
** Sorting bidix(aligning too): |
|||
apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted |
|||
* Sorting+Deduplicating the bidix for future purposes: |
|||
** Bidix count before deduplication: 11008 |
|||
** Fixing the bidix: |
|||
*** Removed 1501 entries in section main |
|||
** Bidix count after deduplication: 9507 |
|||
** Sorted+Regrouped the dix. |
|||
* Made a script to calculate DixCount, Coverage, WER/PER at once |
|||
* '''Calculating WER/PER:''' |
|||
** Apertium-eval-translator: |
|||
*** ''https://github.com/apertium/apertium-eval-translator'' |
|||
$ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt |
|||
$ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt |
|||
$ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt |
|||
$ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt |
|||
** '''Parallel text:''' |
|||
*** JaM Story: |
|||
*** “Azamat va Oygul” in our case; |
|||
*** kaz-uzb/texts/[kaz|uzb].txt |
|||
** Astana article from Kazakh Wiki |
|||
* '''Calculated dix Coverage:''' |
|||
** (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin |
|||
** coverage: 26752095 / 32305875 (~0.82808761564266561423) |
|||
** remaining unknown forms: 5553780 |
|||
** kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021 |
|||
* '''Counting Dix elements:''' |
|||
** Apertium-Eval: dixcounter.py: |
|||
python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix |
|||
** July 09: 11008 dix elements before deduplication. |
|||
* Translating kaz-uig.dix into kaz-uzb |
|||
* Translating kaz-kaa.dix into kaz-uzb.dix |
|||
** Removing those that were already done by crossdic |
|||
** Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words |
|||
* Translating kaz-tur.dix into kaz-uzb.dix |
|||
** Removing those that were already done by crossdic |
|||
** Changing turkish translation into uzbek one by looking at both kazakh and turkish words |
|||
** Added 3200 more words from this. |
|||
* '''Extract Kazakh wikipedia:''' |
|||
** Kazakh wiki date: 01.05.2021 |
|||
*** https://dumps.wikimedia.org/kkwiki/20210501/ |
|||
** Apertium-tools: WikiExtractor: |
|||
*** Wiki: ''https://wiki.apertium.org/wiki/WikiExtractor'' |
|||
*** Code: ''https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py'' |
|||
python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2 |
|||
** Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens |
|||
* '''Big Crossdix action:''' |
|||
** Crossing dictionaries: kaz-kaa and kaa-uzb |
|||
** Crossing dictionaries: kaz-tur and tur-uzb |
|||
** Merge the obtained crossdix outputs from two |
|||
** Sort the merged file, removing duplicates |
|||
** Align the result for better visibility |
|||
** Manually check every translation, remove if bad, correct/add if necessary |
|||
** Added 5000+ words from this. |
|||
* Start collecting the kaz-uzb bilingual dictionary |
|||
* '''Convert the pair to apertium-recursive''' |
|||
** Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch. |
|||
* Write some lexical selection rules in kaz-uzb. |
|||
** Wiki: ''https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules'' |
|||
** Definitely needed samples: |
|||
*** The closest sample is apertium-kaz-tur |
|||
*** ''https://github.com/apertium/apertium-kaz-tur'' |
|||
* '''Translate small text (James & Mary story)''' |
|||
** Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam |
|||
** James&Mary story was downloaded from the source. |
|||
** apertium-kaz-uzb/texts/ |
|||
** Updated as Azamat & Oygul story |
|||
* Translating a text file: |
|||
cat texts/kaz.txt | apertium -d . -f line kaz-uzb |
|||
* Translation of a sentence: |
|||
echo 'Сәлем Әлем' | apertium -d . kaz-uzb |
|||
* Bootstrapping a new language pair apertium-kaz-uzb |
|||
** Installed Apertium-init from pip |
|||
*** Had a problem, solved it (tanks to @popcorndude). |
|||
** Downloaded apertium-init.py |
|||
*** Did not work |
|||
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 |
|||
cd apertium-kaz-uzb |
|||
./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make |
|||
** I had to convert it to apertium-recursive, so further flags to be added: |
|||
*** ''-t rtx'' |
|||
* Final initialization command: |
|||
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx |
|||
* '''Forked(&Installed) necessary repos on GitHub:''' |
|||
** '''''Apertium-kaz''''' |
|||
*** ''git clone git@github.com:kamush901/apertium-kaz.git'' |
|||
*** Works well |
|||
** '''''Apertium-uzb''''' |
|||
*** ''git@github.com:kamush901/apertium-uzb.git'' |
|||
*** Works well |
|||
* '''Installed Apertium and necessary tools:''' |
|||
** Installed Apertium core using packaging |
|||
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash |
|||
sudo apt-get -f install apertium-all-dev |
Latest revision as of 11:51, 31 August 2021
Progress Report[edit]
Time Period | Goal | Bidix | Coverage | WER,PER | Details/Comments | |
---|---|---|---|---|---|---|
kaz-uzb | kaz-uzb | kaz-uzb | uzb-kaz | |||
Community Bonding Period
May 17-June 5 |
|
426
(+426) |
43.80 % | - | - |
|
Week 1
June 6-12 |
Make Uzbek better | 2220
(+1794) |
52.11 % | - | - |
|
Week 2
June 13-19 |
Expand bilingual dictionary | 5262
(+3042) |
77.03 % | 74.77% / 67.57% | 64.23% / 54.37% |
|
Week 3
June 20-26 |
More on .dix and .lrx | 8543
(+3281) |
81.55 % | 74.77% / 67.57% | 64.23% / 54.37% |
|
Week 4
June 27-July 3 |
Focus on transfer rules | 9432
(+889) |
81.85 % | 74.77% / 67.57% | 64.23% / 54.37% |
|
Week 5
July 4-10 |
Test translator and expand more | 11008
(+1576) |
82.81% | 74.77% / 67.57% | 64.23% / 54.37% |
|
Week 6
July 11-17 |
Focus more on transfer rules | +191 | 84.46 % | 71.83% / 63.52% | 67.24% / 60.31% |
|
Week 7
July 18-24 |
Test the kaz-uzb translator | +219 | 84.58 % | 69.21% / 60.69% | 67.10% / 59.89% |
|
Week 8
July 25-31 |
Focus on transfer rules | +500 | 85.17 % | 69.21% / 60.69% | 67.10% / 59.89% |
|
Week 9
August 1-7 |
Focus on testvoc | +1000 | 86.02 % | 71.37% / 64.81% | 61.93% / 56.05% |
|
Week 10
August 8-14 |
Finalize work | +400 | 86.02 % | 71.37% / 64.81% | 61.93% / 56.05% |
|
TODO[edit]
- More transfer rules
- Testvoc
ONGOING[edit]
- Transfer rules
- Some lexical selection rules
- Collecting more bidix
- Adding only most-frequent words from kazakh wiki hitparade until 85% coverage is reached
DONE (+Notes & Comments)[edit]
- Made a script to calculate WER for each line of the kaz-uzb.txt file to see what sentences are causing the most problems.
- Translating big Kaz text into Uzb
- Aside from those small texts obtained from parallel corpora
- For better WER/PER calculation
- For checking transfer rules
- Started with readily available small JaM Story
- Azmat and Oygul story in our case
- Added 47 sentences from that.
- Chose Nur-Sultan(capital city) article of Kazakh Wiki for that.
- Split the article into sentences, translated into uzb manually.
- Added 112 sentences out of Nur-Sultan text.
- Used QED parallel corpus form internet.
- Manually selected sentences and corrected their translations.
- Added 90 sentences from this.
- Made 250 sentences in total, stopping this here.
- Tried finding parallel corpora:
- https://opus.nlpl.eu/
- KDE4: Not good: https://opus.nlpl.eu/KDE4/v2/kk-uz_sample.html
- MozillaI10: Nope
- QED:
- kk-uz.tmx file
- Good, Gonna use this, but needs manual selection and corrections on translations.
- TED: the same file with QED.
- The rest is also not good, just one-word translation mostly.
- https://opus.nlpl.eu/
- Lexical selection rules for kaz-uzb
- Created a script to analyse bidix and find ambiguous stems.
- Created a script that generates lexical selection rules from manually chosen stems
- Rules added in kaz-uzb.lrx for 1565 words.
- Rules added in uzb-kaz.lrx for 1295 words.
- Apertium-dixtools:
- https://wiki.apertium.org/wiki/Apertium-dixtools
- Fixing bidix(deduplicating, removing empty lines):
apertium-dixtools fix apertium-kaz-uzb.kaz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.fixed
- Sorting bidix(aligning too):
apertium-dixtools sort -alignBidix -ignorecase apertium-kaz-uzb.uz-uzb.dix apertium-kaz-uzb.kaz-uzb.dix.sorted
- Sorting+Deduplicating the bidix for future purposes:
- Bidix count before deduplication: 11008
- Fixing the bidix:
- Removed 1501 entries in section main
- Bidix count after deduplication: 9507
- Sorted+Regrouped the dix.
- Made a script to calculate DixCount, Coverage, WER/PER at once
- Calculating WER/PER:
- Apertium-eval-translator:
$ cat kaz-big.txt | apertium -d ../ kaz-uzb > kaz-uzb.txt $ cat uzb-big.txt | apertium -d ../ uzb-kaz > uzb-kaz.txt $ apertium-eval-translator -ref uzb-big.txt -test kaz-uzb.txt $ apertium-eval-translator -ref kaz-big.txt -test uzb-kaz.txt
- Parallel text:
- JaM Story:
- “Azamat va Oygul” in our case;
- kaz-uzb/texts/[kaz|uzb].txt
- Astana article from Kazakh Wiki
- Parallel text:
- Calculated dix Coverage:
- (kaz-uzb/texts): bash ../../coverage-ltproc-new.sh ../docs/kaz-wiki.txt ../kaz-uzb.automorf.bin
- coverage: 26752095 / 32305875 (~0.82808761564266561423)
- remaining unknown forms: 5553780
- kaz-wiki.txt Sun Jul 11 11:55:25 CEST 2021
- Counting Dix elements:
- Apertium-Eval: dixcounter.py:
python3 ../dixcounter.py apertium-kaz-uzb.kaz-uzb.dix
- July 09: 11008 dix elements before deduplication.
- Translating kaz-uig.dix into kaz-uzb
- Translating kaz-kaa.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing karakalpak translation into uzbek one by looking at both kazakh and karakalpak words
- Translating kaz-tur.dix into kaz-uzb.dix
- Removing those that were already done by crossdic
- Changing turkish translation into uzbek one by looking at both kazakh and turkish words
- Added 3200 more words from this.
- Extract Kazakh wikipedia:
- Kazakh wiki date: 01.05.2021
- Apertium-tools: WikiExtractor:
python3 WikiExtractor.py --infn kkwiki-20210501-pages-articles-multistream.xml.bz2
- Saved resulting corpus in docs/kaz-wiki.txt, ~24M tokens
- Big Crossdix action:
- Crossing dictionaries: kaz-kaa and kaa-uzb
- Crossing dictionaries: kaz-tur and tur-uzb
- Merge the obtained crossdix outputs from two
- Sort the merged file, removing duplicates
- Align the result for better visibility
- Manually check every translation, remove if bad, correct/add if necessary
- Added 5000+ words from this.
- Start collecting the kaz-uzb bilingual dictionary
- Convert the pair to apertium-recursive
- Done!!!, but I had to remove the entire bilingual repo and recreate it from scratch.
- Write some lexical selection rules in kaz-uzb.
- Wiki: https://wiki.apertium.org/wiki/How_to_get_started_with_lexical_selection_rules
- Definitely needed samples:
- The closest sample is apertium-kaz-tur
- https://github.com/apertium/apertium-kaz-tur
- Translate small text (James & Mary story)
- Source: https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
- James&Mary story was downloaded from the source.
- apertium-kaz-uzb/texts/
- Updated as Azamat & Oygul story
- Translating a text file:
cat texts/kaz.txt | apertium -d . -f line kaz-uzb
- Translation of a sentence:
echo 'Сәлем Әлем' | apertium -d . kaz-uzb
- Bootstrapping a new language pair apertium-kaz-uzb
- Installed Apertium-init from pip
- Had a problem, solved it (tanks to @popcorndude).
- Downloaded apertium-init.py
- Did not work
- Installed Apertium-init from pip
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 cd apertium-kaz-uzb ./autogen.sh --with-lang1=../apertium-kaz --with-lang2=../apertium-uzb & make
- I had to convert it to apertium-recursive, so further flags to be added:
- -t rtx
- I had to convert it to apertium-recursive, so further flags to be added:
- Final initialization command:
python3 apertium-init.py kaz-uzb --analyser=hfst --no-prob1 --no-prob2 -t rtx
- Forked(&Installed) necessary repos on GitHub:
- Apertium-kaz
- git clone git@github.com:kamush901/apertium-kaz.git
- Works well
- Apertium-uzb
- git@github.com:kamush901/apertium-uzb.git
- Works well
- Apertium-kaz
- Installed Apertium and necessary tools:
- Installed Apertium core using packaging
wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash sudo apt-get -f install apertium-all-dev