Difference between revisions of "User:Elmurod1202/GSoC2020Progress"
Jump to navigation
Jump to search
Elmurod1202 (talk | contribs) m (→Status table) |
Elmurod1202 (talk | contribs) m (→Status table) |
||
(27 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | The Project Proposal can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Proposal here] |
||
+ | |||
+ | The Final Report can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Final_Report here] |
||
+ | |||
= Status table = |
= Status table = |
||
Line 6: | Line 10: | ||
!colspan="2"|Stems |
!colspan="2"|Stems |
||
!colspan="2"|Tur-Uzb |
!colspan="2"|Tur-Uzb |
||
− | !colspan="2"|Uzb-Tur |
||
!colspan="2"|Naïve Coverage |
!colspan="2"|Naïve Coverage |
||
!colspan="2"|Progress |
!colspan="2"|Progress |
||
Line 14: | Line 17: | ||
! uzb |
! uzb |
||
! tur-uzb |
! tur-uzb |
||
− | ! WER |
||
− | ! PER |
||
! WER |
! WER |
||
! PER |
! PER |
||
Line 24: | Line 25: | ||
|- |
|- |
||
| 0 |
| 0 |
||
− | | |
+ | | May 4 - May 31 |
− | | |
+ | | 34375 |
− | | |
+ | | 2412 |
| 90.80 % |
| 90.80 % |
||
| 81.60 % |
| 81.60 % |
||
− | | |
+ | | 89.57 % |
− | | |
+ | | 72.14 % |
− | | - |
||
− | | - |
||
|Initial evaluation |
|Initial evaluation |
||
+ | | As of the end of May |
||
− | | |
||
+ | |- |
||
+ | | 5 |
||
+ | | June 29 - July 5 |
||
+ | | 34373 |
||
+ | | 2445 |
||
+ | | 84.45 % |
||
+ | | 76.80 % |
||
+ | | 90.23 % |
||
+ | | 72.14 % |
||
+ | | First Evaluation |
||
+ | | End of June - ~July 3 |
||
+ | |- |
||
+ | | 9 |
||
+ | | July 27 - Aug 2 |
||
+ | | 34424 |
||
+ | | 4191 |
||
+ | | 78.70 % |
||
+ | | 68.34 % |
||
+ | | 90.23 % |
||
+ | | 72.74 % |
||
+ | | Second Evaluation |
||
+ | | As of July 31 - Aug 1 |
||
+ | |- |
||
+ | | 10 |
||
+ | | July 3 - Aug 9 |
||
+ | | 35621 |
||
+ | | 5639 |
||
+ | | 78.70 % |
||
+ | | 68.64 % |
||
+ | | 90.28 % |
||
+ | | 80.14 % |
||
+ | | Weekly evaluation |
||
+ | | Week #10 |
||
+ | |- |
||
+ | | 11 |
||
+ | | Aug 10 - Aug 16 |
||
+ | | 37649 |
||
+ | | 8154 |
||
+ | | 78.70 % |
||
+ | | 68.64 % |
||
+ | | 90.46 % |
||
+ | | 83.08 % |
||
+ | | Weekly evaluation |
||
+ | | Week #11 |
||
+ | |- |
||
+ | | 12 |
||
+ | | Aug 17 - Aug 23 |
||
+ | | 57406 |
||
+ | | 13023 |
||
+ | | 78.70 % |
||
+ | | 68.64 % |
||
+ | | 90.91 % |
||
+ | | 86.02 % |
||
+ | | Weekly evaluation |
||
+ | | Week #12 |
||
+ | |- |
||
+ | | 13 |
||
+ | | Aug 24 - Aug 30 |
||
+ | | 58757 |
||
+ | | 12861 |
||
+ | | 78.70 % |
||
+ | | 68.64 % |
||
+ | | 90.94 % |
||
+ | | 86.03 % |
||
+ | | Final evaluation |
||
+ | | As of Aug 31 |
||
+ | |- |
||
|} |
|} |
||
+ | =Apertium Notes= |
||
− | = To Do = |
||
+ | |||
− | === Community bonding period (May 4 - June 1): === |
||
+ | |||
− | *Getting closer with Apertium tools and community |
||
+ | == TODO == |
||
− | *Finding out the current state of Uzbek language |
||
+ | * TESTVOC |
||
− | *Finding out the availability of Uzbek resources available |
||
+ | * Writing script to automatically make Lexc rules(for entries in bidix) |
||
− | *Learning more about the HFST |
||
+ | |||
− | *Doing coding challenge |
||
+ | == ONGOING == |
||
− | *Begin interacting with Apertium's core system |
||
+ | * Insert entries from the Word Frequency List |
||
+ | * Checking and correcting the entire bidix |
||
+ | * Work on Lexical Selection rules. |
||
+ | * Review: Nouns |
||
+ | * Review: Proper nouns |
||
+ | * Review: Postposition |
||
+ | * Review: Pronouns |
||
+ | * Review: Verbs |
||
+ | * Review: Punctuation |
||
+ | * Review: Numerals |
||
+ | |||
+ | == NOTES == |
||
+ | |||
+ | * On this day of August 8: |
||
+ | **1876: Thomas Edison invents Autographic Printing; |
||
+ | **2020: Me hardly passes 80% barrier on Trimmed Coverage :D |
||
+ | |||
+ | * Creating a Word Frequency List from Corpus: |
||
+ | ** Aka: Hitparade; |
||
+ | cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt |
||
+ | * Review: Interjections - DONE! |
||
+ | * Review: Determinatives - DONE! |
||
+ | * Review: Conjunctions - DONE! |
||
+ | |||
+ | * TurkicCorpora: |
||
+ | ** https://gitlab.com/unhammer/turkiccorpora |
||
+ | ** SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2 |
||
+ | |||
+ | * Long lagging stuff: |
||
+ | ** Fix Trimmed coverage - DONE!. |
||
+ | |||
+ | * Review: Adverbs - DONE! |
||
+ | * Add Section “Regexp” to bidix - DONE! |
||
+ | * Add Section “Unchecked” to bidix - DONE! |
||
+ | * Review: Adjectives - DONE! |
||
+ | * Review: Abbreviations - DONE! |
||
+ | * Combining checked and unchecked sections of bidix : DONE |
||
+ | |||
+ | * Fixing the bidix, deduplication(24.07.2020): |
||
+ | ** Before the fix: Entries:4418 |
||
+ | ** After the fix: Entries: 4000 |
||
+ | |||
+ | * Firespeaker: |
||
+ | ** elmurod1202: you already have a trimmed transducer |
||
+ | ** elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms |
||
+ | ** elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin |
||
+ | ** there's scripts all over that do this already |
||
+ | ** it's a few lines of bash |
||
+ | |||
+ | * Install Apertium-dixtools -DONE |
||
+ | * Tur-Uzb Translation example: |
||
+ | cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb |
||
+ | * Correct names in a story -- Done! |
||
+ | ** Anthroponyms |
||
+ | * Lexicon for Turkic languages: |
||
+ | ** https://wiki.apertium.org/wiki/Turkic_lexicon |
||
+ | * Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py |
||
+ | * Calculate Coverage for tur-uzb - Got an error there -SOLVED!. |
||
+ | * Crossing en-tr and en-uz to get tr-uz dictionary: |
||
+ | awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt |
||
+ | * Update JaM Story - DONE!. |
||
+ | |||
+ | * ''Remember:'' '''“Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”''' |
||
+ | ** To do so, there is an easy way: |
||
+ | ** Just run this in apertium-tur-uzb directory: make langs |
||
+ | ** Even better?, Use: make -j3 langs |
||
+ | |||
+ | * ''Helpful Quotes( :-) ):'' |
||
+ | ** “The best script is maybe(often) no script”(@Firespeaker) |
||
+ | ** “There is no stupid question, only stupid students”(@Spectei) |
||
+ | |||
+ | == EVALUATION == |
||
+ | * Calculating Trimmed Coverage: |
||
+ | ** https://gitlab.com/unhammer/turkiccorpora/-/blob/master/dev/lt-covtest |
||
+ | ** |
||
+ | sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2 |
||
+ | ** As of Aug 1: 72.74 % |
||
+ | * Calculating Coverage: |
||
+ | ** apertium-quality |
||
+ | git clone https://github.com/apertium/apertium-quality |
||
+ | ***It works \o/, phew. |
||
+ | aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14% |
||
+ | aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin - 89.36% |
||
+ | ** Hfst-covtest |
||
+ | ***https://gist.github.com/IlnarSelimcan/87bef975e919836e90865e44935a6bd7#file-hfst-covtest |
||
+ | bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt |
||
+ | ***coverage: 12128223 / 13539389 (89.57 %) |
||
+ | ***remaining unknown forms: 1411166 |
||
+ | ***Turkish mono coverage: |
||
+ | **** 64272995 / 73673100 (~0.8724) |
||
+ | ** The way @piraye calculates coverage: |
||
+ | ***https://github.com/sevilaybayatli/Coverage-evaluation/blob/master/Coverage.txt |
||
+ | * Calculating WER/PER: |
||
+ | ** apertium-eval-translator |
||
+ | perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt |
||
+ | ** Parallel corpora was taken from James and Martin story |
||
+ | ***https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam |
||
+ | |||
+ | * Counting Stems: |
||
+ | ** Counting dix stems |
||
+ | ***Apertium-tools - dixcounter.py |
||
+ | python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix |
||
+ | ***Stems: |
||
+ | **** Beginning: 2412 |
||
+ | **** Before fix: 2468 |
||
+ | **** After deuplication: 2065 |
||
+ | ** Counting Lexc Stems |
||
+ | python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc |
||
+ | ***Apertium-uzb.uzb.lexc: 34375 |
||
+ | ***Apertium-tur.tur.lexc : 21634 |
||
+ | ***Apertium-tur-uzb.uzb.lexc: 3922 |
||
+ | ***apertium-tur-uzb.tur.lexc : 11206 |
||
+ | |||
+ | |||
+ | |||
+ | == SETTING UP INSTRUCTIONS == |
||
+ | * Apertium-dixtools: |
||
+ | ** Fixing bidix(deduplicatiing, removing empty lines): |
||
+ | apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed |
||
+ | ** Sorting bidix(aligning too): |
||
+ | apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted |
||
+ | * Extracting Wikipedia corpus: |
||
+ | wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2 |
||
+ | ** Aperitum-quality project aq-wikiextract |
||
+ | ***Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem |
||
+ | ** Github https://github.com/attardi/wikiextractor |
||
+ | ***Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out. |
||
+ | ** Github https://github.com/bwbaugh/wikipedia-extractor |
||
+ | ***Did work. \o/ . 149K articles (as of June 2020) |
||
+ | ***To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html |
||
+ | ***To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt |
||
+ | ***To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt |
||
+ | ***Result: Uzbek wiki corpus with >9M words |
||
+ | ***Note: Though, not used this extraction. |
||
+ | ** Apertium-tools: WikiExtractor: |
||
+ | ***Wiki: https://wiki.apertium.org/wiki/WikiExtractor |
||
+ | ***Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py |
||
+ | ***Works, output is just a clean corpus, simple and neat |
||
+ | python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2 |
||
+ | ***Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2 |
||
+ | ***Will be using this corpus for evaluations. |
||
+ | * Translation of a sentence: |
||
+ | echo 'Salom dunyo' | apertium -d . uzb-tur |
||
+ | * Translation of a text file: |
||
+ | cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt |
||
+ | cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt |
||
+ | |||
+ | * Apertium Installation: |
||
+ | ** Installed Apertium core using packaging |
||
+ | ***wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash |
||
+ | ***sudo apt-get -f install apertium-all-dev |
||
+ | ** Installed language data by compiling |
||
+ | ***git clone https://github.com/elmurod1202/apertium-tur-uzb.git |
||
+ | ***git clone https://github.com/elmurod1202/apertium-uzb.git |
||
+ | ***git clone https://github.com/elmurod1202/apertium-tur.git |
||
+ | ** Configure, build: |
||
+ | ***Note: The structure of a language pair is three different folders containing their own data, in my case: |
||
+ | **** apertium-uzb folder |
||
+ | **** apertium-tur folder |
||
+ | **** apertium-tur-uzb folder |
||
+ | ***cd apertium-tur-uzb |
||
+ | ./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb |
||
+ | make |
||
+ | ***For a mono, it’s just: |
||
+ | ./autogen.sh |
||
+ | make |
||
+ | ***Optional: Apertium-viewer |
||
+ | **** Just to see steps of translation in a user-friendly way |
||
+ | **** https://wiki.apertium.org/wiki/Apertium-viewer |
||
+ | **** java -Xmx500m -jar apertium-viewer.jar |
||
+ | |||
+ | * Forked(&Installed) necessary repos on GitHub: |
||
+ | ** Apertium-uzb |
||
+ | ***Monolingual package for Uzbek |
||
+ | ** Apertium-tur |
||
+ | ***Monolingual package for Turkish |
||
+ | ** Apertium-tur-uzb |
||
+ | ***Bilingual package for Turkish-Uzbek translation |
||
+ | ** apertiumpp |
||
+ | ***For parallel corpora used in evaluation |
||
+ | ** Apertium-quality |
||
+ | ***Name says it, to evaluate |
||
+ | **** aq-covtest - checks coverage |
||
+ | ***Also, for extracting wiki, but it didn’t work for me |
||
+ | **** aq-wikiextract - extracts wiki corpora from wiki dump |
||
+ | ** apertium-eval-translator |
||
+ | ***To evaluate WER/PER |
||
+ | ** apertium-viewer |
||
+ | ***To have a nice GUI, but you basically don’t need that |
||
+ | ** Apertium-dixtools |
||
+ | ***To sort/fix dictionaries |
||
+ | |||
+ | == GitHub Stuff == |
||
+ | * Branch naming convention: |
||
+ | ** Upstream - the original repo you want to have a slice of |
||
+ | ** Master - your branch/fork from upstream |
||
+ | ** Origin - a local copy of the branch in your machine |
||
+ | * To synch with all remotes: |
||
+ | ** git fetch -all |
||
+ | * To see the remote links: |
||
+ | ** git remote -v |
||
+ | * To be able to sync with the original repo: |
||
+ | ** Adding an upstream remote: |
||
+ | ***git remote add upstream https://github.com/apertium/apertium-uzb.git |
||
+ | ** With fetch & merge: |
||
+ | ***Fetching the upstream to your project: |
||
+ | **** git fetch upstream |
||
+ | ***Merging changes: |
||
+ | **** git merge upstream/master |
||
+ | ** Or, just do: |
||
+ | ***git pull upstream master |
||
+ | ** Apparently, “git pull” contains: |
||
+ | ***git fetch |
||
+ | ***git merge |
||
+ | * To reset the repo to previous version: |
||
+ | ** git reset --hard <commit hash> |
||
+ | ** !!! You loose all commits made after |
||
+ | ** Discard all local changes to all files permanently: |
||
+ | ***git reset --hard |
||
+ | * To merge(squash) last two commits: |
||
+ | ** git rebase --interactive HEAD~2 |
||
+ | ** Then choose the last commit and edit push to squash, save, exit |
||
+ | ** Give the commit message you want |
||
+ | ** Done. |
||
+ | == Terminal stuff == |
||
− | = Ongoing = |
||
+ | * Screen(Working with multiple screens): |
||
− | == Community Bonding Period (May 4 - June 1) == |
||
+ | ** Advantage: Ability to access from both SSH and local machine |
||
− | *Finding out the current state of Uzbek language |
||
+ | ** Installation: |
||
− | *Finding out the availability of Uzbek resources |
||
+ | ***sudo apt-get install screen |
||
+ | ** Listing the active screens: |
||
+ | ***screen -ls |
||
+ | ** Starting a new session: |
||
+ | ***screen -S screenName |
||
+ | ** Exiting the session: |
||
+ | ***exit |
||
+ | ** Deactivating current screen: |
||
+ | ***Ctrl+A, press D |
||
+ | ** Returning to the background (deactivated) screen: |
||
+ | ***screen -r screenName |
||
+ | == Useful Bash commands == |
||
+ | * Reading file line by line and printing line with other format: |
||
+ | while IFS= read -r line; do |
||
+ | echo "Text read from file: $line" |
||
+ | done < my_filename.txt |
||
+ | * Printing specific line (given line number) of a text: |
||
+ | ** $ sed -n 5p file |
||
+ | ** Returns 5th line |
||
+ | * Printing a specific column of a text file: |
||
+ | ** awk -F":" '{print $1}' file.txt |
||
+ | ** -F”:” specifies the separator. |
||
+ | == NOTE FOR THE READER: == |
||
− | = Completed = |
||
+ | ''''Here it started, notes are being created from bottom to top, so the last action comes first.'''' |
||
− | *Getting closer with Apertium tools and community |
Latest revision as of 15:20, 5 September 2020
The Project Proposal can be seen here
The Final Report can be seen here
Contents
Status table[edit]
Week | Stems | Tur-Uzb | Naïve Coverage | Progress | |||||
---|---|---|---|---|---|---|---|---|---|
№ | Dates | uzb | tur-uzb | WER | PER | uzb | tur-uzb | Evaluation | Notes |
0 | May 4 - May 31 | 34375 | 2412 | 90.80 % | 81.60 % | 89.57 % | 72.14 % | Initial evaluation | As of the end of May |
5 | June 29 - July 5 | 34373 | 2445 | 84.45 % | 76.80 % | 90.23 % | 72.14 % | First Evaluation | End of June - ~July 3 |
9 | July 27 - Aug 2 | 34424 | 4191 | 78.70 % | 68.34 % | 90.23 % | 72.74 % | Second Evaluation | As of July 31 - Aug 1 |
10 | July 3 - Aug 9 | 35621 | 5639 | 78.70 % | 68.64 % | 90.28 % | 80.14 % | Weekly evaluation | Week #10 |
11 | Aug 10 - Aug 16 | 37649 | 8154 | 78.70 % | 68.64 % | 90.46 % | 83.08 % | Weekly evaluation | Week #11 |
12 | Aug 17 - Aug 23 | 57406 | 13023 | 78.70 % | 68.64 % | 90.91 % | 86.02 % | Weekly evaluation | Week #12 |
13 | Aug 24 - Aug 30 | 58757 | 12861 | 78.70 % | 68.64 % | 90.94 % | 86.03 % | Final evaluation | As of Aug 31 |
Apertium Notes[edit]
TODO[edit]
- TESTVOC
- Writing script to automatically make Lexc rules(for entries in bidix)
ONGOING[edit]
- Insert entries from the Word Frequency List
- Checking and correcting the entire bidix
- Work on Lexical Selection rules.
- Review: Nouns
- Review: Proper nouns
- Review: Postposition
- Review: Pronouns
- Review: Verbs
- Review: Punctuation
- Review: Numerals
NOTES[edit]
- On this day of August 8:
- 1876: Thomas Edison invents Autographic Printing;
- 2020: Me hardly passes 80% barrier on Trimmed Coverage :D
- Creating a Word Frequency List from Corpus:
- Aka: Hitparade;
cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt
- Review: Interjections - DONE!
- Review: Determinatives - DONE!
- Review: Conjunctions - DONE!
- TurkicCorpora:
- https://gitlab.com/unhammer/turkiccorpora
- SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2
- Long lagging stuff:
- Fix Trimmed coverage - DONE!.
- Review: Adverbs - DONE!
- Add Section “Regexp” to bidix - DONE!
- Add Section “Unchecked” to bidix - DONE!
- Review: Adjectives - DONE!
- Review: Abbreviations - DONE!
- Combining checked and unchecked sections of bidix : DONE
- Fixing the bidix, deduplication(24.07.2020):
- Before the fix: Entries:4418
- After the fix: Entries: 4000
- Firespeaker:
- elmurod1202: you already have a trimmed transducer
- elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
- elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
- there's scripts all over that do this already
- it's a few lines of bash
- Install Apertium-dixtools -DONE
- Tur-Uzb Translation example:
cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb
- Correct names in a story -- Done!
- Anthroponyms
- Lexicon for Turkic languages:
- Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py
- Calculate Coverage for tur-uzb - Got an error there -SOLVED!.
- Crossing en-tr and en-uz to get tr-uz dictionary:
awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt
- Update JaM Story - DONE!.
- Remember: “Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”
- To do so, there is an easy way:
- Just run this in apertium-tur-uzb directory: make langs
- Even better?, Use: make -j3 langs
- Helpful Quotes( :-) ):
- “The best script is maybe(often) no script”(@Firespeaker)
- “There is no stupid question, only stupid students”(@Spectei)
EVALUATION[edit]
- Calculating Trimmed Coverage:
sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2
- As of Aug 1: 72.74 %
- Calculating Coverage:
- apertium-quality
git clone https://github.com/apertium/apertium-quality
- It works \o/, phew.
aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14% aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin - 89.36%
bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt
- coverage: 12128223 / 13539389 (89.57 %)
- remaining unknown forms: 1411166
- Turkish mono coverage:
- 64272995 / 73673100 (~0.8724)
- The way @piraye calculates coverage:
- Calculating WER/PER:
- apertium-eval-translator
perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt
- Parallel corpora was taken from James and Martin story
- Counting Stems:
- Counting dix stems
- Apertium-tools - dixcounter.py
- Counting dix stems
python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix
- Stems:
- Beginning: 2412
- Before fix: 2468
- After deuplication: 2065
- Stems:
- Counting Lexc Stems
python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc
- Apertium-uzb.uzb.lexc: 34375
- Apertium-tur.tur.lexc : 21634
- Apertium-tur-uzb.uzb.lexc: 3922
- apertium-tur-uzb.tur.lexc : 11206
SETTING UP INSTRUCTIONS[edit]
- Apertium-dixtools:
- Fixing bidix(deduplicatiing, removing empty lines):
apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed
- Sorting bidix(aligning too):
apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted
- Extracting Wikipedia corpus:
wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
- Aperitum-quality project aq-wikiextract
- Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem
- Github https://github.com/attardi/wikiextractor
- Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out.
- Github https://github.com/bwbaugh/wikipedia-extractor
- Did work. \o/ . 149K articles (as of June 2020)
- To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html
- To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt
- To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt
- Result: Uzbek wiki corpus with >9M words
- Note: Though, not used this extraction.
- Apertium-tools: WikiExtractor:
- Wiki: https://wiki.apertium.org/wiki/WikiExtractor
- Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
- Works, output is just a clean corpus, simple and neat
- Aperitum-quality project aq-wikiextract
python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2
- Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
- Will be using this corpus for evaluations.
- Translation of a sentence:
echo 'Salom dunyo' | apertium -d . uzb-tur
- Translation of a text file:
cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt
- Apertium Installation:
- Installed Apertium core using packaging
- wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
- sudo apt-get -f install apertium-all-dev
- Installed language data by compiling
- Configure, build:
- Note: The structure of a language pair is three different folders containing their own data, in my case:
- apertium-uzb folder
- apertium-tur folder
- apertium-tur-uzb folder
- cd apertium-tur-uzb
- Note: The structure of a language pair is three different folders containing their own data, in my case:
- Installed Apertium core using packaging
./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb make
- For a mono, it’s just:
./autogen.sh make
- Optional: Apertium-viewer
- Just to see steps of translation in a user-friendly way
- https://wiki.apertium.org/wiki/Apertium-viewer
- java -Xmx500m -jar apertium-viewer.jar
- Optional: Apertium-viewer
- Forked(&Installed) necessary repos on GitHub:
- Apertium-uzb
- Monolingual package for Uzbek
- Apertium-tur
- Monolingual package for Turkish
- Apertium-tur-uzb
- Bilingual package for Turkish-Uzbek translation
- apertiumpp
- For parallel corpora used in evaluation
- Apertium-quality
- Name says it, to evaluate
- aq-covtest - checks coverage
- Also, for extracting wiki, but it didn’t work for me
- aq-wikiextract - extracts wiki corpora from wiki dump
- Name says it, to evaluate
- apertium-eval-translator
- To evaluate WER/PER
- apertium-viewer
- To have a nice GUI, but you basically don’t need that
- Apertium-dixtools
- To sort/fix dictionaries
- Apertium-uzb
GitHub Stuff[edit]
- Branch naming convention:
- Upstream - the original repo you want to have a slice of
- Master - your branch/fork from upstream
- Origin - a local copy of the branch in your machine
- To synch with all remotes:
- git fetch -all
- To see the remote links:
- git remote -v
- To be able to sync with the original repo:
- Adding an upstream remote:
- git remote add upstream https://github.com/apertium/apertium-uzb.git
- With fetch & merge:
- Fetching the upstream to your project:
- git fetch upstream
- Merging changes:
- git merge upstream/master
- Fetching the upstream to your project:
- Or, just do:
- git pull upstream master
- Apparently, “git pull” contains:
- git fetch
- git merge
- Adding an upstream remote:
- To reset the repo to previous version:
- git reset --hard <commit hash>
- !!! You loose all commits made after
- Discard all local changes to all files permanently:
- git reset --hard
- To merge(squash) last two commits:
- git rebase --interactive HEAD~2
- Then choose the last commit and edit push to squash, save, exit
- Give the commit message you want
- Done.
Terminal stuff[edit]
- Screen(Working with multiple screens):
- Advantage: Ability to access from both SSH and local machine
- Installation:
- sudo apt-get install screen
- Listing the active screens:
- screen -ls
- Starting a new session:
- screen -S screenName
- Exiting the session:
- exit
- Deactivating current screen:
- Ctrl+A, press D
- Returning to the background (deactivated) screen:
- screen -r screenName
Useful Bash commands[edit]
- Reading file line by line and printing line with other format:
while IFS= read -r line; do
echo "Text read from file: $line"
done < my_filename.txt
- Printing specific line (given line number) of a text:
- $ sed -n 5p file
- Returns 5th line
- Printing a specific column of a text file:
- awk -F":" '{print $1}' file.txt
- -F”:” specifies the separator.
NOTE FOR THE READER:[edit]
'Here it started, notes are being created from bottom to top, so the last action comes first.'