Latest revision as of 15:20, 5 September 2020

The Project Proposal can be seen here

The Final Report can be seen here

Status table[edit]

Week		Stems		Tur-Uzb		Naïve Coverage		Progress
№	Dates	uzb	tur-uzb	WER	PER	uzb	tur-uzb	Evaluation	Notes
0	May 4 - May 31	34375	2412	90.80 %	81.60 %	89.57 %	72.14 %	Initial evaluation	As of the end of May
5	June 29 - July 5	34373	2445	84.45 %	76.80 %	90.23 %	72.14 %	First Evaluation	End of June - ~July 3
9	July 27 - Aug 2	34424	4191	78.70 %	68.34 %	90.23 %	72.74 %	Second Evaluation	As of July 31 - Aug 1
10	July 3 - Aug 9	35621	5639	78.70 %	68.64 %	90.28 %	80.14 %	Weekly evaluation	Week #10
11	Aug 10 - Aug 16	37649	8154	78.70 %	68.64 %	90.46 %	83.08 %	Weekly evaluation	Week #11
12	Aug 17 - Aug 23	57406	13023	78.70 %	68.64 %	90.91 %	86.02 %	Weekly evaluation	Week #12
13	Aug 24 - Aug 30	58757	12861	78.70 %	68.64 %	90.94 %	86.03 %	Final evaluation	As of Aug 31

Apertium Notes[edit]

TODO[edit]

TESTVOC
Writing script to automatically make Lexc rules(for entries in bidix)

ONGOING[edit]

Insert entries from the Word Frequency List
Checking and correcting the entire bidix
Work on Lexical Selection rules.
Review: Nouns
Review: Proper nouns
Review: Postposition
Review: Pronouns
Review: Verbs
Review: Punctuation
Review: Numerals

NOTES[edit]

On this day of August 8:
- 1876: Thomas Edison invents Autographic Printing;
- 2020: Me hardly passes 80% barrier on Trimmed Coverage :D

Creating a Word Frequency List from Corpus:
- Aka: Hitparade;

  cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt

Review: Interjections - DONE!
Review: Determinatives - DONE!
Review: Conjunctions - DONE!

TurkicCorpora:
- https://gitlab.com/unhammer/turkiccorpora
- SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2

Long lagging stuff:
- Fix Trimmed coverage - DONE!.

Review: Adverbs - DONE!
Add Section “Regexp” to bidix - DONE!
Add Section “Unchecked” to bidix - DONE!
Review: Adjectives - DONE!
Review: Abbreviations - DONE!
Combining checked and unchecked sections of bidix : DONE

Fixing the bidix, deduplication(24.07.2020):
- Before the fix: Entries:4418
- After the fix: Entries: 4000

Firespeaker:
- elmurod1202: you already have a trimmed transducer
- elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
- elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
- there's scripts all over that do this already
- it's a few lines of bash

Install Apertium-dixtools -DONE
Tur-Uzb Translation example:

  cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb

Correct names in a story -- Done!
- Anthroponyms
Lexicon for Turkic languages:
- https://wiki.apertium.org/wiki/Turkic_lexicon
Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py
Calculate Coverage for tur-uzb - Got an error there -SOLVED!.
Crossing en-tr and en-uz to get tr-uz dictionary:

  awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt

Update JaM Story - DONE!.

Remember: “Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”
- To do so, there is an easy way:
- Just run this in apertium-tur-uzb directory: make langs
- Even better?, Use: make -j3 langs

Helpful Quotes( :-) ):
- “The best script is maybe(often) no script”(@Firespeaker)
- “There is no stupid question, only stupid students”(@Spectei)

EVALUATION[edit]

Calculating Trimmed Coverage:
- https://gitlab.com/unhammer/turkiccorpora/-/blob/master/dev/lt-covtest

  sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2

- As of Aug 1: 72.74 %
Calculating Coverage:
- apertium-quality

  git clone https://github.com/apertium/apertium-quality

- - It works \o/, phew.

  aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14%
  aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin  - 89.36%

- Hfst-covtest
  - https://gist.github.com/IlnarSelimcan/87bef975e919836e90865e44935a6bd7#file-hfst-covtest

  bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt

- - coverage: 12128223 / 13539389 (89.57 %)
  - remaining unknown forms: 1411166
  - Turkish mono coverage:
    - 64272995 / 73673100 (~0.8724)
- The way @piraye calculates coverage:
  - https://github.com/sevilaybayatli/Coverage-evaluation/blob/master/Coverage.txt
Calculating WER/PER:
- apertium-eval-translator

  perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt

- Parallel corpora was taken from James and Martin story
  - https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam

Counting Stems:
- Counting dix stems
  - Apertium-tools - dixcounter.py

  python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix

- - Stems:
    - Beginning: 2412
    - Before fix: 2468
    - After deuplication: 2065
- Counting Lexc Stems

  python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc

- - Apertium-uzb.uzb.lexc: 34375
  - Apertium-tur.tur.lexc : 21634
  - Apertium-tur-uzb.uzb.lexc: 3922
  - apertium-tur-uzb.tur.lexc : 11206

SETTING UP INSTRUCTIONS[edit]

Apertium-dixtools:
- Fixing bidix(deduplicatiing, removing empty lines):

  apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed

- Sorting bidix(aligning too):

  apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted

Extracting Wikipedia corpus:

  wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2

- Aperitum-quality project aq-wikiextract
  - Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem
- Github https://github.com/attardi/wikiextractor
  - Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out.
- Github https://github.com/bwbaugh/wikipedia-extractor
  - Did work. \o/ . 149K articles (as of June 2020)
  - To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html
  - To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt
  - To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt
  - Result: Uzbek wiki corpus with >9M words
  - Note: Though, not used this extraction.
- Apertium-tools: WikiExtractor:
  - Wiki: https://wiki.apertium.org/wiki/WikiExtractor
  - Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
  - Works, output is just a clean corpus, simple and neat

  python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2

- - Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
  - Will be using this corpus for evaluations.
Translation of a sentence:

  echo 'Salom dunyo' | apertium -d . uzb-tur

Translation of a text file:

  cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt
  cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt

Apertium Installation:
- Installed Apertium core using packaging
  - wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
  - sudo apt-get -f install apertium-all-dev
- Installed language data by compiling
  - git clone https://github.com/elmurod1202/apertium-tur-uzb.git
  - git clone https://github.com/elmurod1202/apertium-uzb.git
  - git clone https://github.com/elmurod1202/apertium-tur.git
- Configure, build:
  - Note: The structure of a language pair is three different folders containing their own data, in my case:
    - apertium-uzb folder
    - apertium-tur folder
    - apertium-tur-uzb folder
  - cd apertium-tur-uzb

./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb make

- - For a mono, it’s just:

./autogen.sh
 make

- - Optional: Apertium-viewer
    - Just to see steps of translation in a user-friendly way
    - https://wiki.apertium.org/wiki/Apertium-viewer
    - java -Xmx500m -jar apertium-viewer.jar

Forked(&Installed) necessary repos on GitHub:
- Apertium-uzb
  - Monolingual package for Uzbek
- Apertium-tur
  - Monolingual package for Turkish
- Apertium-tur-uzb
  - Bilingual package for Turkish-Uzbek translation
- apertiumpp
  - For parallel corpora used in evaluation
- Apertium-quality
  - Name says it, to evaluate
    - aq-covtest - checks coverage
  - Also, for extracting wiki, but it didn’t work for me
    - aq-wikiextract - extracts wiki corpora from wiki dump
- apertium-eval-translator
  - To evaluate WER/PER
- apertium-viewer
  - To have a nice GUI, but you basically don’t need that
- Apertium-dixtools
  - To sort/fix dictionaries

GitHub Stuff[edit]

Branch naming convention:
- Upstream - the original repo you want to have a slice of
- Master - your branch/fork from upstream
- Origin - a local copy of the branch in your machine
To synch with all remotes:
- git fetch -all
To see the remote links:
- git remote -v
To be able to sync with the original repo:
- Adding an upstream remote:
  - git remote add upstream https://github.com/apertium/apertium-uzb.git
- With fetch & merge:
  - Fetching the upstream to your project:
    - git fetch upstream
  - Merging changes:
    - git merge upstream/master
- Or, just do:
  - git pull upstream master
- Apparently, “git pull” contains:
  - git fetch
  - git merge
To reset the repo to previous version:
- git reset --hard <commit hash>
- !!! You loose all commits made after
- Discard all local changes to all files permanently:
  - git reset --hard
To merge(squash) last two commits:
- git rebase --interactive HEAD~2
- Then choose the last commit and edit push to squash, save, exit
- Give the commit message you want
- Done.

Terminal stuff[edit]

Screen(Working with multiple screens):
- Advantage: Ability to access from both SSH and local machine
- Installation:
  - sudo apt-get install screen
- Listing the active screens:
  - screen -ls
- Starting a new session:
  - screen -S screenName
- Exiting the session:
  - exit
- Deactivating current screen:
  - Ctrl+A, press D
- Returning to the background (deactivated) screen:
  - screen -r screenName

Useful Bash commands[edit]

Reading file line by line and printing line with other format:

while IFS= read -r line; do

   echo "Text read from file: $line"

done < my_filename.txt

Printing specific line (given line number) of a text:
- $ sed -n 5p file
- Returns 5th line
Printing a specific column of a text file:
- awk -F":" '{print $1}' file.txt
- -F”:” specifies the separator.

NOTE FOR THE READER:[edit]

'Here it started, notes are being created from bottom to top, so the last action comes first.'

Difference between revisions of "User:Elmurod1202/GSoC2020Progress"

Latest revision as of 15:20, 5 September 2020

Contents

Status table[edit]

Apertium Notes[edit]

TODO[edit]

ONGOING[edit]

NOTES[edit]

EVALUATION[edit]

SETTING UP INSTRUCTIONS[edit]

GitHub Stuff[edit]

Terminal stuff[edit]

Useful Bash commands[edit]

NOTE FOR THE READER:[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+The Project Proposal can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Proposal here]
+The Final Report can be seen [https://wiki.apertium.org/wiki/User:Elmurod1202/GSoC2020_Final_Report here]
 = Status table =
@@ Line 6: / Line 10: @@
 !colspan="2"|Stems
 !colspan="2"|Tur-Uzb
-!colspan="2"|Uzb-Tur
 !colspan="2"|Naïve Coverage
 !colspan="2"|Progress
@@ Line 14: / Line 17: @@
 ! uzb
 ! tur-uzb
-! WER
-! PER
 ! WER
 ! PER
@@ Line 24: / Line 25: @@
 |-
 | 0
-| 04.05-31.05
+| May 4 - May 31
-| -
+| 34375
-| -
+| 2412
 | 90.80 %
 | 81.60 %
-| 97.01 %
+| 89.57 %
-| 92.36 %
+| 72.14 %
-| -
-| -
 |Initial evaluation
+| As of the end of May
-|
+|-
+| 5
+| June 29 - July 5
+| 34373
+| 2445
+| 84.45 %
+| 76.80 %
+| 90.23 %
+| 72.14 %
+| First Evaluation
+| End of June - ~July 3
+|-
+| 9
+| July 27 - Aug 2
+| 34424
+| 4191
+| 78.70 %
+| 68.34 %
+| 90.23 %
+| 72.74 %
+| Second Evaluation
+| As of July 31 - Aug 1
+|-
+| 10
+| July 3 - Aug 9
+| 35621
+| 5639
+| 78.70 %
+| 68.64 %
+| 90.28 %
+| 80.14 %
+| Weekly evaluation
+| Week #10
+|-
+| 11
+| Aug 10 - Aug 16
+| 37649
+| 8154
+| 78.70 %
+| 68.64 %
+| 90.46 %
+| 83.08 %
+| Weekly evaluation
+| Week #11
+|-
+| 12
+| Aug 17 - Aug 23
+| 57406
+| 13023
+| 78.70 %
+| 68.64 %
+| 90.91 %
+| 86.02 %
+| Weekly evaluation
+| Week #12
+|-
+| 13
+| Aug 24 - Aug 30
+| 58757
+| 12861
+| 78.70 %
+| 68.64 %
+| 90.94 %
+| 86.03 %
+| Final evaluation
+| As of Aug 31
+|-
 |}
+=Apertium Notes=
-= To Do =
-=== Community bonding period (May 4 - June 1): ===
-*Getting closer with Apertium tools and community
+== TODO ==
-*Finding out the current state of Uzbek language
+* TESTVOC
-*Finding out the availability of Uzbek resources available
+* Writing script to automatically make Lexc rules(for entries in bidix)
-*Learning more about the HFST
-*Doing coding challenge
+== ONGOING ==
-*Begin interacting with Apertium's core system
+* Insert entries from the Word Frequency List
+* Checking and correcting the entire bidix
+* Work on Lexical Selection rules.
+* Review: Nouns
+* Review: Proper nouns
+* Review: Postposition
+* Review: Pronouns
+* Review: Verbs
+* Review: Punctuation
+* Review: Numerals
+== NOTES ==
+* On this day of August 8:
+**1876: Thomas Edison invents Autographic Printing;
+**2020: Me hardly passes 80% barrier on Trimmed Coverage :D
+* Creating a Word Frequency List from Corpus:
+** Aka: Hitparade;
+   cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt
+* Review: Interjections - DONE!
+* Review: Determinatives - DONE!
+* Review: Conjunctions - DONE!
+* TurkicCorpora:
+** https://gitlab.com/unhammer/turkiccorpora
+** SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2
+* Long lagging stuff:
+** Fix Trimmed coverage - DONE!.
+* Review: Adverbs - DONE!
+* Add Section “Regexp” to bidix - DONE!
+* Add Section “Unchecked” to bidix - DONE!
+* Review: Adjectives - DONE!
+* Review: Abbreviations - DONE!
+* Combining checked and unchecked sections of bidix : DONE
+* Fixing the bidix, deduplication(24.07.2020):
+** Before the fix: Entries:4418
+** After the fix: Entries: 4000
+* Firespeaker:
+** elmurod1202: you already have a trimmed transducer
+** elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
+** elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
+** there's scripts all over that do this already
+** it's a few lines of bash
+* Install Apertium-dixtools -DONE
+* Tur-Uzb Translation example:
+   cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb
+* Correct names in a story -- Done!
+** Anthroponyms
+* Lexicon for Turkic languages:
+** https://wiki.apertium.org/wiki/Turkic_lexicon
+* Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py
+* Calculate Coverage for tur-uzb - Got an error there -SOLVED!.
+* Crossing en-tr and en-uz to get tr-uz dictionary:
+   awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt
+* Update JaM Story - DONE!.
+* ''Remember:'' '''“Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”'''
+** To do so, there is an easy way:
+** Just run this in apertium-tur-uzb directory: make langs
+** Even better?, Use: make -j3 langs
+* ''Helpful Quotes( :-) ):''
+** “The best script is maybe(often) no script”(@Firespeaker)
+** “There is no stupid question, only stupid students”(@Spectei)
+== EVALUATION ==
+* Calculating Trimmed Coverage:
+** https://gitlab.com/unhammer/turkiccorpora/-/blob/master/dev/lt-covtest
+**
+   sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2
+** As of Aug 1: 72.74 %
+* Calculating Coverage:
+** apertium-quality
+   git clone https://github.com/apertium/apertium-quality
+***It works \o/, phew.
+   aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14%
+   aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin  - 89.36%
+** Hfst-covtest
+***https://gist.github.com/IlnarSelimcan/87bef975e919836e90865e44935a6bd7#file-hfst-covtest
+   bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt
+***coverage: 12128223 / 13539389 (89.57 %)
+***remaining unknown forms: 1411166
+***Turkish mono coverage:
+**** 64272995 / 73673100 (~0.8724)
+** The way @piraye calculates coverage:
+***https://github.com/sevilaybayatli/Coverage-evaluation/blob/master/Coverage.txt
+* Calculating WER/PER:
+** apertium-eval-translator
+   perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt
+** Parallel corpora was taken from James and Martin story
+***https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam
+* Counting Stems:
+** Counting dix stems
+***Apertium-tools - dixcounter.py
+   python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix
+***Stems:
+**** Beginning: 2412
+**** Before fix: 2468
+**** After deuplication: 2065
+** Counting Lexc Stems
+   python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc
+***Apertium-uzb.uzb.lexc: 34375
+***Apertium-tur.tur.lexc : 21634
+***Apertium-tur-uzb.uzb.lexc: 3922
+***apertium-tur-uzb.tur.lexc : 11206
+== SETTING UP INSTRUCTIONS ==
+* Apertium-dixtools:
+** Fixing bidix(deduplicatiing, removing empty lines):
+   apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed
+** Sorting bidix(aligning too):
+   apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted
+* Extracting Wikipedia corpus:
+   wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
+** Aperitum-quality project aq-wikiextract
+***Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem
+** Github https://github.com/attardi/wikiextractor
+***Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out.
+** Github https://github.com/bwbaugh/wikipedia-extractor
+***Did work. \o/ . 149K articles (as of June 2020)
+***To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html
+***To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt
+***To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt
+***Result: Uzbek wiki corpus with >9M words
+***Note: Though, not used this extraction.
+** Apertium-tools: WikiExtractor:
+***Wiki: https://wiki.apertium.org/wiki/WikiExtractor
+***Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
+***Works, output is just a clean corpus, simple and neat
+   python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2
+***Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
+***Will be using this corpus for evaluations.
+* Translation of a sentence:
+   echo 'Salom dunyo' | apertium -d . uzb-tur
+* Translation of a text file:
+   cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt
+   cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt
+* Apertium Installation:
+** Installed Apertium core using packaging
+***wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
+***sudo apt-get -f install apertium-all-dev
+** Installed language data by compiling
+***git clone https://github.com/elmurod1202/apertium-tur-uzb.git
+***git clone https://github.com/elmurod1202/apertium-uzb.git
+***git clone https://github.com/elmurod1202/apertium-tur.git
+** Configure, build:
+***Note: The structure of a language pair is three different folders containing their own data, in my case:
+**** apertium-uzb folder
+**** apertium-tur folder
+**** apertium-tur-uzb folder
+***cd apertium-tur-uzb
+./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb
+make
+***For a mono, it’s just:
+ ./autogen.sh
+  make
+***Optional: Apertium-viewer
+**** Just to see steps of translation in a user-friendly way
+**** https://wiki.apertium.org/wiki/Apertium-viewer
+**** java -Xmx500m -jar apertium-viewer.jar
+* Forked(&Installed) necessary repos on GitHub:
+** Apertium-uzb
+***Monolingual package for Uzbek
+** Apertium-tur
+***Monolingual package for Turkish
+** Apertium-tur-uzb
+***Bilingual package for Turkish-Uzbek translation
+** apertiumpp
+***For parallel corpora used in evaluation
+** Apertium-quality
+***Name says it, to evaluate
+**** aq-covtest - checks coverage
+***Also, for extracting wiki, but it didn’t work for me
+**** aq-wikiextract - extracts wiki corpora from wiki dump
+** apertium-eval-translator
+***To evaluate WER/PER
+** apertium-viewer
+***To have a nice GUI, but you basically don’t need that
+** Apertium-dixtools
+***To sort/fix dictionaries
+== GitHub Stuff ==
+* Branch naming convention:
+** Upstream - the original repo you want to have a slice of
+** Master - your branch/fork from upstream
+** Origin - a local copy of the branch in your machine
+* To synch with all remotes:
+** git fetch -all
+* To see the remote links:
+** git remote -v
+* To be able to sync with the original repo:
+** Adding an upstream remote:
+***git remote add upstream https://github.com/apertium/apertium-uzb.git
+** With fetch & merge:
+***Fetching the upstream to your project:
+**** git fetch upstream
+***Merging changes:
+**** git merge upstream/master
+** Or, just do:
+***git pull upstream master
+** Apparently, “git pull” contains:
+***git fetch
+***git merge
+* To reset the repo to previous version:
+** git reset --hard <commit hash>
+** !!! You loose all commits made after
+** Discard all local changes to all files permanently:
+***git reset --hard
+* To merge(squash) last two commits:
+** git rebase --interactive HEAD~2
+** Then choose the last commit and edit push to squash, save, exit
+** Give the commit message you want
+** Done.
+== Terminal stuff ==
-= Ongoing =
+* Screen(Working with multiple screens):
-== Community Bonding Period (May 4 - June 1) ==
+** Advantage: Ability to access from both SSH and local machine
-*Finding out the current state of Uzbek language
+** Installation:
-*Finding out the availability of Uzbek resources
+***sudo apt-get install screen
+** Listing the active screens:
+***screen -ls
+** Starting a new session:
+***screen -S screenName
+** Exiting the session:
+***exit
+** Deactivating current screen:
+***Ctrl+A, press D
+** Returning to the background (deactivated) screen:
+***screen -r screenName
+== Useful Bash commands ==
+* Reading file line by line and printing line with other format:
+while IFS= read -r line; do
+    echo "Text read from file: $line"
+done < my_filename.txt
+* Printing specific line (given line number) of a text:
+** $ sed -n 5p file
+** Returns 5th line
+* Printing a specific column of a text file:
+** awk -F":" '{print $1}' file.txt
+** -F”:” specifies the separator.
+== NOTE FOR THE READER: ==
-= Completed =
+''''Here it started, notes are being created from bottom to top, so the last action comes first.''''
-*Getting closer with Apertium tools and community

Difference between revisions of "User:Elmurod1202/GSoC2020Progress"

Latest revision as of 15:20, 5 September 2020

Contents

Status table[edit]

﻿Apertium Notes[edit]

TODO[edit]

ONGOING[edit]

NOTES[edit]

EVALUATION[edit]

SETTING UP INSTRUCTIONS[edit]

GitHub Stuff[edit]

Terminal stuff[edit]

Useful Bash commands[edit]

NOTE FOR THE READER:[edit]

Navigation menu

Search

Apertium Notes[edit]