Difference between revisions of "User:Elmurod1202/GSoC2020Progress"

From Apertium
Jump to navigation Jump to search
 
(18 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The original proposal can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Proposal here]
+
The Project Proposal can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Proposal here]
  +
  +
The Final Report can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Final_Report here]
  +
 
= Status table =
 
= Status table =
   
Line 7: Line 10:
 
!colspan="2"|Stems
 
!colspan="2"|Stems
 
!colspan="2"|Tur-Uzb
 
!colspan="2"|Tur-Uzb
!colspan="2"|Uzb-Tur
 
 
!colspan="2"|Naïve Coverage
 
!colspan="2"|Naïve Coverage
 
!colspan="2"|Progress
 
!colspan="2"|Progress
Line 15: Line 17:
 
! uzb
 
! uzb
 
! tur-uzb
 
! tur-uzb
! WER
 
! PER
 
 
! WER
 
! WER
 
! PER
 
! PER
Line 27: Line 27:
 
| May 4 - May 31
 
| May 4 - May 31
 
| 34375
 
| 34375
| -
+
| 2412
 
| 90.80 %
 
| 90.80 %
 
| 81.60 %
 
| 81.60 %
| 97.01 %
 
| 92.36 %
 
 
| 89.57 %
 
| 89.57 %
 
| 72.14 %
 
| 72.14 %
 
|Initial evaluation
 
|Initial evaluation
  +
| As of the end of May
|
 
|-
 
| 1
 
| June 1 - June 7
 
|-
 
| 2
 
| May 8 - June 14
 
|-
 
| 3
 
| June 15 - June 21
 
|-
 
| 4
 
| June 22 - June 28
 
 
|-
 
|-
 
| 5
 
| 5
 
| June 29 - July 5
 
| June 29 - July 5
 
| 34373
 
| 34373
  +
| 2445
|
 
  +
| 84.45 %
|
 
  +
| 76.80 %
|
 
  +
| 90.23 %
|
 
|
 
|
 
 
| 72.14 %
 
| 72.14 %
  +
| First Evaluation
|-
 
  +
| End of June - ~July 3
| 6
 
| July 6 - July 12
 
|-
 
| 7
 
| July 13 - July 19
 
|-
 
| 8
 
| July 20 - July 26
 
 
|-
 
|-
 
| 9
 
| 9
 
| July 27 - Aug 2
 
| July 27 - Aug 2
  +
| 34424
  +
| 4191
  +
| 78.70 %
  +
| 68.34 %
  +
| 90.23 %
  +
| 72.74 %
  +
| Second Evaluation
  +
| As of July 31 - Aug 1
 
|-
 
|-
 
| 10
 
| 10
 
| July 3 - Aug 9
 
| July 3 - Aug 9
  +
| 35621
  +
| 5639
  +
| 78.70 %
  +
| 68.64 %
  +
| 90.28 %
  +
| 80.14 %
  +
| Weekly evaluation
  +
| Week #10
 
|-
 
|-
 
| 11
 
| 11
 
| Aug 10 - Aug 16
 
| Aug 10 - Aug 16
  +
| 37649
  +
| 8154
  +
| 78.70 %
  +
| 68.64 %
  +
| 90.46 %
  +
| 83.08 %
  +
| Weekly evaluation
  +
| Week #11
 
|-
 
|-
 
| 12
 
| 12
 
| Aug 17 - Aug 23
 
| Aug 17 - Aug 23
  +
| 57406
  +
| 13023
  +
| 78.70 %
  +
| 68.64 %
  +
| 90.91 %
  +
| 86.02 %
  +
| Weekly evaluation
  +
| Week #12
  +
|-
  +
| 13
  +
| Aug 24 - Aug 30
  +
| 58757
  +
| 12861
  +
| 78.70 %
  +
| 68.64 %
  +
| 90.94 %
  +
| 86.03 %
  +
| Final evaluation
  +
| As of Aug 31
 
|-
 
|-
 
|}
 
|}
   
  +
=Apertium Notes=
= To Do =
 
  +
Week 1-4
 
  +
*Introducing apertium-separable to the tur-uzb pair
 
  +
== TODO ==
*Adding more stems to bilingual dictionary;
 
  +
* TESTVOC
*Transfer rules refactoring;
 
  +
* Writing script to automatically make Lexc rules(for entries in bidix)
*Increasing WER coverage;
 
  +
*Running tests
 
  +
== ONGOING ==
*Updating documentation
 
  +
* Insert entries from the Word Frequency List
*Preparing for the first evaluation
 
  +
* Checking and correcting the entire bidix
  +
* Work on Lexical Selection rules.
  +
* Review: Nouns
  +
* Review: Proper nouns
  +
* Review: Postposition
  +
* Review: Pronouns
  +
* Review: Verbs
  +
* Review: Punctuation
  +
* Review: Numerals
  +
  +
== NOTES ==
  +
  +
* On this day of August 8:
  +
**1876: Thomas Edison invents Autographic Printing;
  +
**2020: Me hardly passes 80% barrier on Trimmed Coverage :D
  +
  +
* Creating a Word Frequency List from Corpus:
  +
** Aka: Hitparade;
  +
cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt
  +
* Review: Interjections - DONE!
  +
* Review: Determinatives - DONE!
  +
* Review: Conjunctions - DONE!
  +
  +
* TurkicCorpora:
  +
** https://gitlab.com/unhammer/turkiccorpora
  +
** SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2
  +
  +
* Long lagging stuff:
  +
** Fix Trimmed coverage - DONE!.
  +
  +
* Review: Adverbs - DONE!
  +
* Add Section “Regexp” to bidix - DONE!
  +
* Add Section “Unchecked” to bidix - DONE!
  +
* Review: Adjectives - DONE!
  +
* Review: Abbreviations - DONE!
  +
* Combining checked and unchecked sections of bidix : DONE
  +
  +
* Fixing the bidix, deduplication(24.07.2020):
  +
** Before the fix: Entries:4418
  +
** After the fix: Entries: 4000
  +
  +
* Firespeaker:
  +
** elmurod1202: you already have a trimmed transducer
  +
** elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
  +
** elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
  +
** there's scripts all over that do this already
  +
** it's a few lines of bash
  +
  +
* Install Apertium-dixtools -DONE
  +
* Tur-Uzb Translation example:
  +
cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb
  +
* Correct names in a story -- Done!
  +
** Anthroponyms
  +
* Lexicon for Turkic languages:
  +
** https://wiki.apertium.org/wiki/Turkic_lexicon
  +
* Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py
  +
* Calculate Coverage for tur-uzb - Got an error there -SOLVED!.
  +
* Crossing en-tr and en-uz to get tr-uz dictionary:
  +
awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt
  +
* Update JaM Story - DONE!.
  +
  +
* ''Remember:'' '''“Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”'''
  +
** To do so, there is an easy way:
  +
** Just run this in apertium-tur-uzb directory: make langs
  +
** Even better?, Use: make -j3 langs
  +
  +
* ''Helpful Quotes( :-) ):''
  +
** “The best script is maybe(often) no script”(@Firespeaker)
  +
** “There is no stupid question, only stupid students”(@Spectei)
  +
  +
== EVALUATION ==
  +
* Calculating Trimmed Coverage:
  +
** https://gitlab.com/unhammer/turkiccorpora/-/blob/master/dev/lt-covtest
  +
**
  +
sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2
  +
** As of Aug 1: 72.74 %
  +
* Calculating Coverage:
  +
** apertium-quality
  +
git clone https://github.com/apertium/apertium-quality
  +
***It works \o/, phew.
  +
aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14%
  +
aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin - 89.36%
  +
** Hfst-covtest
  +
***https://gist.github.com/IlnarSelimcan/87bef975e919836e90865e44935a6bd7#file-hfst-covtest
  +
bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt
  +
***coverage: 12128223 / 13539389 (89.57 %)
  +
***remaining unknown forms: 1411166
  +
***Turkish mono coverage:
  +
**** 64272995 / 73673100 (~0.8724)
  +
** The way @piraye calculates coverage:
  +
***https://github.com/sevilaybayatli/Coverage-evaluation/blob/master/Coverage.txt
  +
* Calculating WER/PER:
  +
** apertium-eval-translator
  +
perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt
  +
** Parallel corpora was taken from James and Martin story
  +
***https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
  +
  +
* Counting Stems:
  +
** Counting dix stems
  +
***Apertium-tools - dixcounter.py
  +
python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix
  +
***Stems:
  +
**** Beginning: 2412
  +
**** Before fix: 2468
  +
**** After deuplication: 2065
  +
** Counting Lexc Stems
  +
python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc
  +
***Apertium-uzb.uzb.lexc: 34375
  +
***Apertium-tur.tur.lexc : 21634
  +
***Apertium-tur-uzb.uzb.lexc: 3922
  +
***apertium-tur-uzb.tur.lexc : 11206
  +
  +
  +
  +
== SETTING UP INSTRUCTIONS ==
  +
* Apertium-dixtools:
  +
** Fixing bidix(deduplicatiing, removing empty lines):
  +
apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed
  +
** Sorting bidix(aligning too):
  +
apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted
  +
* Extracting Wikipedia corpus:
  +
wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
  +
** Aperitum-quality project aq-wikiextract
  +
***Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem
  +
** Github https://github.com/attardi/wikiextractor
  +
***Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out.
  +
** Github https://github.com/bwbaugh/wikipedia-extractor
  +
***Did work. \o/ . 149K articles (as of June 2020)
  +
***To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html
  +
***To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt
  +
***To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt
  +
***Result: Uzbek wiki corpus with >9M words
  +
***Note: Though, not used this extraction.
  +
** Apertium-tools: WikiExtractor:
  +
***Wiki: https://wiki.apertium.org/wiki/WikiExtractor
  +
***Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
  +
***Works, output is just a clean corpus, simple and neat
  +
python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2
  +
***Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
  +
***Will be using this corpus for evaluations.
  +
* Translation of a sentence:
  +
echo 'Salom dunyo' | apertium -d . uzb-tur
  +
* Translation of a text file:
  +
cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt
  +
cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt
  +
  +
* Apertium Installation:
  +
** Installed Apertium core using packaging
  +
***wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
  +
***sudo apt-get -f install apertium-all-dev
  +
** Installed language data by compiling
  +
***git clone https://github.com/elmurod1202/apertium-tur-uzb.git
  +
***git clone https://github.com/elmurod1202/apertium-uzb.git
  +
***git clone https://github.com/elmurod1202/apertium-tur.git
  +
** Configure, build:
  +
***Note: The structure of a language pair is three different folders containing their own data, in my case:
  +
**** apertium-uzb folder
  +
**** apertium-tur folder
  +
**** apertium-tur-uzb folder
  +
***cd apertium-tur-uzb
  +
./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb
  +
make
  +
***For a mono, it’s just:
  +
./autogen.sh
  +
make
  +
***Optional: Apertium-viewer
  +
**** Just to see steps of translation in a user-friendly way
  +
**** https://wiki.apertium.org/wiki/Apertium-viewer
  +
**** java -Xmx500m -jar apertium-viewer.jar
  +
  +
* Forked(&Installed) necessary repos on GitHub:
  +
** Apertium-uzb
  +
***Monolingual package for Uzbek
  +
** Apertium-tur
  +
***Monolingual package for Turkish
  +
** Apertium-tur-uzb
  +
***Bilingual package for Turkish-Uzbek translation
  +
** apertiumpp
  +
***For parallel corpora used in evaluation
  +
** Apertium-quality
  +
***Name says it, to evaluate
  +
**** aq-covtest - checks coverage
  +
***Also, for extracting wiki, but it didn’t work for me
  +
**** aq-wikiextract - extracts wiki corpora from wiki dump
  +
** apertium-eval-translator
  +
***To evaluate WER/PER
  +
** apertium-viewer
  +
***To have a nice GUI, but you basically don’t need that
  +
** Apertium-dixtools
  +
***To sort/fix dictionaries
  +
  +
== GitHub Stuff ==
  +
* Branch naming convention:
  +
** Upstream - the original repo you want to have a slice of
  +
** Master - your branch/fork from upstream
  +
** Origin - a local copy of the branch in your machine
  +
* To synch with all remotes:
  +
** git fetch -all
  +
* To see the remote links:
  +
** git remote -v
  +
* To be able to sync with the original repo:
  +
** Adding an upstream remote:
  +
***git remote add upstream https://github.com/apertium/apertium-uzb.git
  +
** With fetch & merge:
  +
***Fetching the upstream to your project:
  +
**** git fetch upstream
  +
***Merging changes:
  +
**** git merge upstream/master
  +
** Or, just do:
  +
***git pull upstream master
  +
** Apparently, “git pull” contains:
  +
***git fetch
  +
***git merge
  +
* To reset the repo to previous version:
  +
** git reset --hard <commit hash>
  +
** !!! You loose all commits made after
  +
** Discard all local changes to all files permanently:
  +
***git reset --hard
  +
* To merge(squash) last two commits:
  +
** git rebase --interactive HEAD~2
  +
** Then choose the last commit and edit push to squash, save, exit
  +
** Give the commit message you want
  +
** Done.
  +
  +
== Terminal stuff ==
  +
* Screen(Working with multiple screens):
  +
** Advantage: Ability to access from both SSH and local machine
  +
** Installation:
  +
***sudo apt-get install screen
  +
** Listing the active screens:
  +
***screen -ls
  +
** Starting a new session:
  +
***screen -S screenName
  +
** Exiting the session:
  +
***exit
  +
** Deactivating current screen:
  +
***Ctrl+A, press D
  +
** Returning to the background (deactivated) screen:
  +
***screen -r screenName
   
  +
== Useful Bash commands ==
= Ongoing =
 
  +
* Reading file line by line and printing line with other format:
* Calculating initial naive coverage of monolingual apertium-uzb;
 
  +
while IFS= read -r line; do
* Calculating initial naive coverage of bilingual apertium-tur-uzb;
 
  +
echo "Text read from file: $line"
  +
done < my_filename.txt
  +
* Printing specific line (given line number) of a text:
  +
** $ sed -n 5p file
  +
** Returns 5th line
  +
* Printing a specific column of a text file:
  +
** awk -F":" '{print $1}' file.txt
  +
** -F”:” specifies the separator.
   
= Done =
 
=== Community bonding period (May 4 - June 1): ===
 
*Getting closer with Apertium tools and community
 
*Finding out the current state of Uzbek language
 
*Finding out the availability of Uzbek resources
 
*Learning more about the HFST
 
*Doing coding challenge
 
*Begin interacting with Apertium's core system
 
   
  +
== NOTE FOR THE READER: ==
= Notes =
 
  +
''''Here it started, notes are being created from bottom to top, so the last action comes first.''''

Latest revision as of 15:20, 5 September 2020

The Project Proposal can be seen here

The Final Report can be seen here

Status table[edit]

Week Stems Tur-Uzb Naïve Coverage Progress
Dates uzb tur-uzb WER PER uzb tur-uzb Evaluation Notes
0 May 4 - May 31 34375 2412 90.80 % 81.60 % 89.57 % 72.14 % Initial evaluation As of the end of May
5 June 29 - July 5 34373 2445 84.45 % 76.80 % 90.23 % 72.14 % First Evaluation End of June - ~July 3
9 July 27 - Aug 2 34424 4191 78.70 % 68.34 % 90.23 % 72.74 % Second Evaluation As of July 31 - Aug 1
10 July 3 - Aug 9 35621 5639 78.70 % 68.64 % 90.28 % 80.14 % Weekly evaluation Week #10
11 Aug 10 - Aug 16 37649 8154 78.70 % 68.64 % 90.46 % 83.08 % Weekly evaluation Week #11
12 Aug 17 - Aug 23 57406 13023 78.70 % 68.64 % 90.91 % 86.02 % Weekly evaluation Week #12
13 Aug 24 - Aug 30 58757 12861 78.70 % 68.64 % 90.94 % 86.03 % Final evaluation As of Aug 31

Apertium Notes[edit]

TODO[edit]

  • TESTVOC
  • Writing script to automatically make Lexc rules(for entries in bidix)

ONGOING[edit]

  • Insert entries from the Word Frequency List
  • Checking and correcting the entire bidix
  • Work on Lexical Selection rules.
  • Review: Nouns
  • Review: Proper nouns
  • Review: Postposition
  • Review: Pronouns
  • Review: Verbs
  • Review: Punctuation
  • Review: Numerals

NOTES[edit]

  • On this day of August 8:
    • 1876: Thomas Edison invents Autographic Printing;
    • 2020: Me hardly passes 80% barrier on Trimmed Coverage :D
  • Creating a Word Frequency List from Corpus:
    • Aka: Hitparade;
  cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt
  • Review: Interjections - DONE!
  • Review: Determinatives - DONE!
  • Review: Conjunctions - DONE!
  • Long lagging stuff:
    • Fix Trimmed coverage - DONE!.
  • Review: Adverbs - DONE!
  • Add Section “Regexp” to bidix - DONE!
  • Add Section “Unchecked” to bidix - DONE!
  • Review: Adjectives - DONE!
  • Review: Abbreviations - DONE!
  • Combining checked and unchecked sections of bidix : DONE
  • Fixing the bidix, deduplication(24.07.2020):
    • Before the fix: Entries:4418
    • After the fix: Entries: 4000
  • Firespeaker:
    • elmurod1202: you already have a trimmed transducer
    • elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
    • elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
    • there's scripts all over that do this already
    • it's a few lines of bash
  • Install Apertium-dixtools -DONE
  • Tur-Uzb Translation example:
  cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb
  awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt
  • Update JaM Story - DONE!.
  • Remember: “Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”
    • To do so, there is an easy way:
    • Just run this in apertium-tur-uzb directory: make langs
    • Even better?, Use: make -j3 langs
  • Helpful Quotes( :-) ):
    • “The best script is maybe(often) no script”(@Firespeaker)
    • “There is no stupid question, only stupid students”(@Spectei)

EVALUATION[edit]

  sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2
    • As of Aug 1: 72.74 %
  • Calculating Coverage:
    • apertium-quality
  git clone https://github.com/apertium/apertium-quality
      • It works \o/, phew.
  aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14%
  aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin  - 89.36%
  bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt
  perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt
  • Counting Stems:
    • Counting dix stems
      • Apertium-tools - dixcounter.py
  python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix 
      • Stems:
        • Beginning: 2412
        • Before fix: 2468
        • After deuplication: 2065
    • Counting Lexc Stems
  python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc 
      • Apertium-uzb.uzb.lexc: 34375
      • Apertium-tur.tur.lexc : 21634
      • Apertium-tur-uzb.uzb.lexc: 3922
      • apertium-tur-uzb.tur.lexc : 11206


SETTING UP INSTRUCTIONS[edit]

  • Apertium-dixtools:
    • Fixing bidix(deduplicatiing, removing empty lines):
  apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed
    • Sorting bidix(aligning too):
  apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted 
  • Extracting Wikipedia corpus:
  wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
  python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2
  echo 'Salom dunyo' | apertium -d . uzb-tur
  • Translation of a text file:
  cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt
  cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt

./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb make

      • For a mono, it’s just:
./autogen.sh
 make
  • Forked(&Installed) necessary repos on GitHub:
    • Apertium-uzb
      • Monolingual package for Uzbek
    • Apertium-tur
      • Monolingual package for Turkish
    • Apertium-tur-uzb
      • Bilingual package for Turkish-Uzbek translation
    • apertiumpp
      • For parallel corpora used in evaluation
    • Apertium-quality
      • Name says it, to evaluate
        • aq-covtest - checks coverage
      • Also, for extracting wiki, but it didn’t work for me
        • aq-wikiextract - extracts wiki corpora from wiki dump
    • apertium-eval-translator
      • To evaluate WER/PER
    • apertium-viewer
      • To have a nice GUI, but you basically don’t need that
    • Apertium-dixtools
      • To sort/fix dictionaries

GitHub Stuff[edit]

  • Branch naming convention:
    • Upstream - the original repo you want to have a slice of
    • Master - your branch/fork from upstream
    • Origin - a local copy of the branch in your machine
  • To synch with all remotes:
    • git fetch -all
  • To see the remote links:
    • git remote -v
  • To be able to sync with the original repo:
    • Adding an upstream remote:
    • With fetch & merge:
      • Fetching the upstream to your project:
        • git fetch upstream
      • Merging changes:
        • git merge upstream/master
    • Or, just do:
      • git pull upstream master
    • Apparently, “git pull” contains:
      • git fetch
      • git merge
  • To reset the repo to previous version:
    • git reset --hard <commit hash>
    •  !!! You loose all commits made after
    • Discard all local changes to all files permanently:
      • git reset --hard
  • To merge(squash) last two commits:
    • git rebase --interactive HEAD~2
    • Then choose the last commit and edit push to squash, save, exit
    • Give the commit message you want
    • Done.

Terminal stuff[edit]

  • Screen(Working with multiple screens):
    • Advantage: Ability to access from both SSH and local machine
    • Installation:
      • sudo apt-get install screen
    • Listing the active screens:
      • screen -ls
    • Starting a new session:
      • screen -S screenName
    • Exiting the session:
      • exit
    • Deactivating current screen:
      • Ctrl+A, press D
    • Returning to the background (deactivated) screen:
      • screen -r screenName

Useful Bash commands[edit]

  • Reading file line by line and printing line with other format:

while IFS= read -r line; do

   echo "Text read from file: $line"

done < my_filename.txt

  • Printing specific line (given line number) of a text:
    • $ sed -n 5p file
    • Returns 5th line
  • Printing a specific column of a text file:
    • awk -F":" '{print $1}' file.txt
    • -F”:” specifies the separator.


NOTE FOR THE READER:[edit]

'Here it started, notes are being created from bottom to top, so the last action comes first.'