The original proposal can be seen here

Status table

Week		Stems		Tur-Uzb		Naïve Coverage		Progress
№	Dates	uzb	tur-uzb	WER	PER	uzb	tur-uzb	Evaluation	Notes
0	May 4 - May 31	34375	2412	90.80 %	81.60 %	89.57 %	72.14 %	Initial evaluation	As of the end of May
5	June 29 - July 5	34373	2445	84.45 %	76.80 %	90.23 %	72.14 %	First Evaluation	End of June - ~July 3
9	July 27 - Aug 2	34424	4191	78.70 %	68.34 %	90.23 %	72.74 %	Second Evaluation	As of July 31 - Aug 1
10	July 3 - Aug 9	35621	5639	78.70 %	68.64 %	90.28 %	80.14 %	Weekly evaluation	Week #10
11	Aug 10 - Aug 16	37649	8154	78.70 %	68.64 %	90.46 %	83.08 %	Weekly evaluation	Week #11
12	Aug 17 - Aug 23	42351	13023	78.70 %	68.64 %	90... %	86.02 %	Not yet finished	Current Week #12
13	Aug 24 - Aug 30

Apertium Notes

TODO

TESTVOC
Writing script to automatically make Lexc rules(for entries in bidix)

ONGOING

Insert entries from the Word Frequency List
Checking and correcting the entire bidix
Work on Lexical Selection rules.
Review: Nouns
Review: Proper nouns
Review: Postposition
Review: Pronouns
Review: Verbs
Review: Punctuation
Review: Numerals

NOTES

On this day of August 8:
- 1876: Thomas Edison invents Autographic Printing;
- 2020: Me hardly passes 80% barrier on Trimmed Coverage :D

Creating a Word Frequency List from Corpus:
- Aka: Hitparade;

  cat corpus.tr.wiki.20200601.txt | apertium-destxt | lt-proc -w ../apertium-tur-uzb/tur-uzb.automorf.bin | apertium-retxt | sed 's/\$\s*/\$\n/g' | grep '\*' | sort | uniq -c | sort -rn > tur-uzb.parade.txt

Review: Interjections - DONE!
Review: Determinatives - DONE!
Review: Conjunctions - DONE!

TurkicCorpora:
- https://gitlab.com/unhammer/turkiccorpora
- SETimes Turkish Corpus: tur.SETimes.en-tr.txt.bz2

Long lagging stuff:
- Fix Trimmed coverage - DONE!.

Review: Adverbs - DONE!
Add Section “Regexp” to bidix - DONE!
Add Section “Unchecked” to bidix - DONE!
Review: Adjectives - DONE!
Review: Abbreviations - DONE!
Combining checked and unchecked sections of bidix : DONE

Fixing the bidix, deduplication(24.07.2020):
- Before the fix: Entries:4418
- After the fix: Entries: 4000

Firespeaker:
- elmurod1202: you already have a trimmed transducer
- elmurod1202: just use it to analyse a corpus and generate a frequency list of unanalysed forms
- elmurod1202: the trimmed transducer is the monolingual transducer limited to the words in bidix, it's tur-uzb.autmorf.bin and uzb-tur.autmorf.bin
- there's scripts all over that do this already
- it's a few lines of bash

Install Apertium-dixtools -DONE
Tur-Uzb Translation example:

  cat texts/tur.txt | apertium -d ../../apertium-tur-uzb/ tur-uzb

Correct names in a story -- Done!
- Anthroponyms
Lexicon for Turkic languages:
- https://wiki.apertium.org/wiki/Turkic_lexicon
Firespeaker: Entries that have a number script on them - their categories were guessed by https://github.com/IlnarSelimcan/dot/blob/master/lexikograf.py
Calculate Coverage for tur-uzb - Got an error there -SOLVED!.
Crossing en-tr and en-uz to get tr-uz dictionary:

  awk 'FNR==NR{a[$1]=$2;next} ($1 in a) {print $1,a[$1],$2}' en-tr.sorted en-uz.sorted > tr-uz.txt

Update JaM Story - DONE!.

Remember: “Every time a set of modifications is made to any of the dictionaries, the modules have to be recompiled. Type make in the directory where the linguistic data are saved”
- To do so, there is an easy way:
- Just run this in apertium-tur-uzb directory: make langs
- Even better?, Use: make -j3 langs

Helpful Quotes( :-) ):
- “The best script is maybe(often) no script”(@Firespeaker)
- “There is no stupid question, only stupid students”(@Spectei)

EVALUATION

Calculating Trimmed Coverage:
- https://gitlab.com/unhammer/turkiccorpora/-/blob/master/dev/lt-covtest

  sh lt-covtest.sh tur-uzb ../apertium-tur-uzb ../corpora/tur.SETimes.en-tr.txt.bz2

- As of Aug 1: 72.74 %
Calculating Coverage:
- apertium-quality

  git clone https://github.com/apertium/apertium-quality

- - It works \o/, phew.

  aq-covtest texts/corpus.wiki.20200520.txt uzb_guesser.automorf.bin - 94.14%
  aq-covtest corpus.wiki.20200520.txt ../apertium-uzb/uzb.automorf.bin  - 89.36%

- Hfst-covtest
  - https://gist.github.com/IlnarSelimcan/87bef975e919836e90865e44935a6bd7#file-hfst-covtest

  bash hfst-covtest.sh uzb ../apertium-uzb/ ../corpora/corpus.wiki.20200520.txt

- - coverage: 12128223 / 13539389 (89.57 %)
  - remaining unknown forms: 1411166
  - Turkish mono coverage:
    - 64272995 / 73673100 (~0.8724)
- The way @piraye calculates coverage:
  - https://github.com/sevilaybayatli/Coverage-evaluation/blob/master/Coverage.txt
Calculating WER/PER:
- apertium-eval-translator

  perl apertium-eval-translator-line.pl -ref ./texts/uzb.txt -test ./texts/tur-uzb.txt

- Parallel corpora was taken from James and Martin story
  - https://github.com/taruen/apertiumpp/tree/master/data4apertium/corpora/jam

Counting Stems:
- Counting dix stems
  - Apertium-tools - dixcounter.py

  python3 dixcounter.py ../apertium-tur-uzb/apertium-tur-uzb.tur-uzb.dix

- - Stems:
    - Beginning: 2412
    - Before fix: 2468
    - After deuplication: 2065
- Counting Lexc Stems

  python3 lexccounter.py ../apertium-uzb/apertium-uzb.uzb.lexc

- - Apertium-uzb.uzb.lexc: 34375
  - Apertium-tur.tur.lexc : 21634
  - Apertium-tur-uzb.uzb.lexc: 3922
  - apertium-tur-uzb.tur.lexc : 11206

SETTING UP INSTRUCTIONS

Apertium-dixtools:
- Fixing bidix(deduplicatiing, removing empty lines):

  apertium-dixtools fix apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.fixed

- Sorting bidix(aligning too):

  apertium-dixtools sort -alignBidix -ignorecase apertium-tur-uzb.tur-uzb.dix apertium-tur-uzb.tur-uzb.dix.sorted

Extracting Wikipedia corpus:

  wget https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2

- Aperitum-quality project aq-wikiextract
  - Note: Did not work for me, error. “NameError: name 'strip_comments' is not defined”, must be a lib conflict problem
- Github https://github.com/attardi/wikiextractor
  - Did work, but, too much. The result included all history, so too many repetitions of a single text, could not work it out.
- Github https://github.com/bwbaugh/wikipedia-extractor
  - Did work. \o/ . 149K articles (as of June 2020)
  - To clean <doc...></doc> tags: sed -e 's/<[^>]*>//g' file.html
  - To delete empty lines: sed -i '/^$/d' uzwiki.clean.txt
  - To remove lines with less than 22 chars (they are useless): sed -r '/^.{,22}$/d' uzwiki.clean.txt > uzwiki.clean2.txt
  - Result: Uzbek wiki corpus with >9M words
  - Note: Though, not used this extraction.
- Apertium-tools: WikiExtractor:
  - Wiki: https://wiki.apertium.org/wiki/WikiExtractor
  - Code: https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/WikiExtractor.py
  - Works, output is just a clean corpus, simple and neat

  python3 apertium-WikiExtractor.py --infn uzwiki.xml.bz2

- - Wiki dump: https://dumps.wikimedia.org/uzwiki/20200520/uzwiki-20200520-pages-articles.xml.bz2
  - Will be using this corpus for evaluations.
Translation of a sentence:

  echo 'Salom dunyo' | apertium -d . uzb-tur

Translation of a text file:

  cat texts/tur.txt |apertium -d . tur-uzb > ./texts/tur-uzb.txt
  cat texts/uzb.txt |apertium -d . uzb-tur > ./texts/uzb-tur.txt

Apertium Installation:
- Installed Apertium core using packaging
  - wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
  - sudo apt-get -f install apertium-all-dev
- Installed language data by compiling
  - git clone https://github.com/elmurod1202/apertium-tur-uzb.git
  - git clone https://github.com/elmurod1202/apertium-uzb.git
  - git clone https://github.com/elmurod1202/apertium-tur.git
- Configure, build:
  - Note: The structure of a language pair is three different folders containing their own data, in my case:
    - apertium-uzb folder
    - apertium-tur folder
    - apertium-tur-uzb folder
  - cd apertium-tur-uzb

./autogen.sh --with-lang1=../apertium-tur --with-lang2=../apertium-uzb make

- - For a mono, it’s just:

./autogen.sh
 make

- - Optional: Apertium-viewer
    - Just to see steps of translation in a user-friendly way
    - https://wiki.apertium.org/wiki/Apertium-viewer
    - java -Xmx500m -jar apertium-viewer.jar

Forked(&Installed) necessary repos on GitHub:
- Apertium-uzb
  - Monolingual package for Uzbek
- Apertium-tur
  - Monolingual package for Turkish
- Apertium-tur-uzb
  - Bilingual package for Turkish-Uzbek translation
- apertiumpp
  - For parallel corpora used in evaluation
- Apertium-quality
  - Name says it, to evaluate
    - aq-covtest - checks coverage
  - Also, for extracting wiki, but it didn’t work for me
    - aq-wikiextract - extracts wiki corpora from wiki dump
- apertium-eval-translator
  - To evaluate WER/PER
- apertium-viewer
  - To have a nice GUI, but you basically don’t need that
- Apertium-dixtools
  - To sort/fix dictionaries

GitHub Stuff

Branch naming convention:
- Upstream - the original repo you want to have a slice of
- Master - your branch/fork from upstream
- Origin - a local copy of the branch in your machine
To synch with all remotes:
- git fetch -all
To see the remote links:
- git remote -v
To be able to sync with the original repo:
- Adding an upstream remote:
  - git remote add upstream https://github.com/apertium/apertium-uzb.git
- With fetch & merge:
  - Fetching the upstream to your project:
    - git fetch upstream
  - Merging changes:
    - git merge upstream/master
- Or, just do:
  - git pull upstream master
- Apparently, “git pull” contains:
  - git fetch
  - git merge
To reset the repo to previous version:
- git reset --hard <commit hash>
- !!! You loose all commits made after
- Discard all local changes to all files permanently:
  - git reset --hard
To merge(squash) last two commits:
- git rebase --interactive HEAD~2
- Then choose the last commit and edit push to squash, save, exit
- Give the commit message you want
- Done.

Terminal stuff

Screen(Working with multiple screens):
- Advantage: Ability to access from both SSH and local machine
- Installation:
  - sudo apt-get install screen
- Listing the active screens:
  - screen -ls
- Starting a new session:
  - screen -S screenName
- Exiting the session:
  - exit
- Deactivating current screen:
  - Ctrl+A, press D
- Returning to the background (deactivated) screen:
  - screen -r screenName

Useful Bash commands

Reading file line by line and printing line with other format:

while IFS= read -r line; do

   echo "Text read from file: $line"

done < my_filename.txt

Printing specific line (given line number) of a text:
- $ sed -n 5p file
- Returns 5th line
Printing a specific column of a text file:
- awk -F":" '{print $1}' file.txt
- -F”:” specifies the separator.

NOTE FOR THE READER:

'Here it started, notes are being created from bottom to top, so the last action comes first.'

User:Elmurod1202/GSoC2020Progress

Contents

Status table

Apertium Notes

TODO

ONGOING

NOTES

EVALUATION

SETTING UP INSTRUCTIONS

GitHub Stuff

Terminal stuff

Useful Bash commands

NOTE FOR THE READER:

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

User:Elmurod1202/GSoC2020Progress

Contents

Status table

﻿Apertium Notes

TODO

ONGOING

NOTES

EVALUATION

SETTING UP INSTRUCTIONS

GitHub Stuff

Terminal stuff

Useful Bash commands

NOTE FOR THE READER:

Navigation menu

Search

Apertium Notes