Instruction

Intro

This tool allows bilingual dictionary enrichment using graph built from bilingual dictionaries. For exmaple, you want to translate église from French to Russian but you don't have this entry. You have: FRA_église - CAT_església, FRA_église - SPA_iglesia, CAT_església - ENG_church, SPA_iglesia - ENG_church, ENG_churh - RUS_церковь

Conneting these edges you get two paths FRA_église - CAT_església - ENG_church - RUS_церковь and FRA_église - SPA_iglesia - ENG_church - RUS_церковь.

Basic steps:

- 0: installing : cloning repository, installing libraries needed
- 1-2: downloading : updating list of dictionaries, downloading
- 3: choosing languages : choose languages you want to use in translation
- 4: preprocessing : dictionary data needs some changes to be used in a graph, this step prepares it for further usage
- 5: configuration file : this file recommends what languages will be the most efficient to enrich particular language pair
- 6: loading file : this file contains all edges of our graph and is used for faster loading graph so we can work with it
- 7: preview : this file shows result of enrichment - all entries tht can be added. You can see coefficients of likelihood of these variants.
- 8: convertation : this step simply creates a xml section that can be inserted in original dictionary

Additional steps:

- Merging dialects : if you wanted to specify all dialects (they should be treated separately), this funtion merges sections into one file with tags that name dialects.
- Evaluation : this function calculates precision, recall and f1-score for a particular language pair. Method: leave-one-out.
- Addition : calculates how many entries we can add
- Example lemma search : if you want to see how it works (variants with coefficients) this is more human-readable version of translation.
- Choosing parameters : if you want to tune parameters, this function allows trying all combinations of parameters you want to check so you achieve more accurate results.

Step 0 : installing

Please mind Python version! In this case Python3

Libraries:

Requests : downloading dictionaries and working with Github (downloading)
Github : working with Github (downloading)
tqdm : progress bars
NetworkX : graph library

   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm

You can work in the virtual environment:

   git clone https://github.com/dkbrz/GSoC_2018_final
   virtualenv -p python3 GSoC_2018_final/
   cd GSoC_2018_final/
   source bin/activate
   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm

Clone repository.

Files:

   /tool/
   |--- __init__.py
   |--- data.py   = some additional data  (language codes, wrong filenames)
   |--- docs.md   = markdown file with function descriptions
   |--- func.py   = main file with functions
   
   .gitignore     = gitignore
   README.md      = readme
   download.txt   = list of files if one doesn't have Github account for updating
   graph.py       = command line interface

To use this tool in command line use graph.py:

This instrument works with Apertium bilingual dictionaries. It is recommended to download all dictionaries, but you can use those that you have locally.

Step 1-2 : downloading

Step 1

Time : 2 min
Changed files : download.txt
New files : -

https://raw.githubusercontent.com/apertium/apertium-afr-nld/master/apertium-afr-nld.afr-nld.dix

https://raw.githubusercontent.com/apertium/apertium-ara-heb/master/apertium-ara-heb.ara-heb.dix

get list of Apertium repositories
get list of contents for each repository

After running you'll see this:

   $ python graph.py update
   Username:   YourUsername
   Password: 
   4%|██▌                          | 21/490 [00:04<01:44,  4.48it/s]

Option 2: you don't have Github account or you're ok with current list of dictionaries. You skip updating and go to the next step.

Step 2

Time : 3 min
Changed files : -
New files : 'dictionaries' folder with bilingual dictionaries

Now you have list of dictionaries that will be downloaded. Just run the code below. It takes urls from download.txt and saves dictionaries in a new folder ('dictionaries').

If you don't want to download all dictionaries (you want to use certain list of dictionaries) you can edit download.txt and save only those you want to use and then run the code.

   python graph.py download

After running this you'll see:

   $ python graph.py download
   6%|███                           | 18/293 [00:09<02:26,  1.87it/s]

This will take several minutes. Meanwhile you can open this directory and find new 'diсtionaries' folder where all files appear (about 290 files).

   $ python graph.py download
   100%|███████████████████████████| 293/293 [02:38<00:00,  1.85it/s]

Now all files are downloaded.

Step 3 : choose languages

Time : several seconds (if default), longer if user directory is used
Changed files : -
New files : filelist.txt

Create list of dictionaries that will be used for translation.

Option 1: if you downloaded them:

   python graph.py list

Option 2: search for dictionaries in some local folder. Give some folder name:

   python graph.py list --path C:/

Option 3: manually create 'filelist.txt' and write absolute paths to these ditionaries.

If you want to specify dialects (por, nor, cat, eng) then add '--dialects True' to this function. This will divide files, e.g. spa-cat -> spa-cat + spa-val

Result example (filelist.txt):

   C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-afr-nld.afr-nld.dix
   C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-ara-heb.ara-heb.dix

Step 4 : preprocessing

Time : 15 min
Changed files : -
New files : 'parsed' folder with changed bilingual dictionaries and 'monodix' folder with monolingual dictionaries containing all words of this language

This step is needed because we handle complex data and direct usage of bilingual dictionaries is slow (parsing files every time) and inaccurate (different tag varinats n vs n-m for the same word). So to increase speed of further functions, we need to preprocess files to format that allows faster and more accurate work.

If you know for sure what set of dictionaries you want to use, edit filelist.txt and delete those that are not relevant. This will shorten time of preprocessing. In case you are not sure, using all files is recommended.

   python graph.py preprocessing

Normal work looks like this:

   $ python graph.py preprocessing
   2018-08-04 21:39:18,174 | INFO : Started monolingual dictionaries
   100%|███████████████████████████████████████████████| 145/145 [03:14<00:00,  1.34s/it]
   2018-08-04 21:42:32,682 | INFO : Finished monolingual dictionaries
   2018-08-04 21:42:32,683 | INFO : Started bilingual dictionaries
   6%|██▉                                             | 18/289 [00:30<07:45,  1.72s/it]

Step 5-8 : working with language pair

Step 5

Time : several seconds
Changed files : -
New files : <lang1>-<lang2>-config file

Configuration file (list of relevant languages). It's better to write in order like in existing dictionary name.

$$ x = \frac 1{ log10( 10 + DictionaryLength )}$$

where Dictionary length = BothSides + 0.5 * LR + 0.5 * RL

It uses these coefficients as edge length. Languages in result file are nodes from top-300 best (shortest) paths between two languages. The length of shortest path with this language is its coefficient by which these languages are sorted. This is used in 'auto' mode when you don't manually select languages and use recommended.

   python graph.py config <lang1> <lang2>

Example:

   python graph.py config eng spa

Result (eng-spa-config):

   0.22082497988083025	eng	:	eng spa
   0.22082497988083025	spa	:	eng spa
   0.424779904070416	cat	:	eng cat spa
   0.44444829199452307	epo	:	eng epo spa
   0.4520428069728663	glg	:	eng glg spa
   0.4675262097740345	ita	:	eng ita spa
   0.4679562192970136	fra	:	eng fra spa
   ...
   0.7466386679162069	rus	:	eng rus epo spa
   0.7557074698418498	lat	:	eng lat spa
   ...
   1.0451794809559942	cos	:	eng ita cos spa
   1.051048358042505	dan	:	eng nor dan deu spa

Step 6

Time : several seconds
Changed files : -
New files : <lang1>-<lang2> file

Loading file (contains edges in graph). This file contains information for a graph used in translation.

   python graph.py load_file <lang1> <lang2> <n=10>

Example:

   python graph.py load_file eng spa
   python graph.py load_file eng spa --n 10

eng-spa loading file:

   ...
   	eng	dispensary	n$n-sg	spa	ambulatorio	n$n-m
   LR	eng	dispensation	n	spa	administración	n-f$n$n-f-sg
   ...

Step 7

Time : ~ 5 min
Changed files : -
New files : <lang1>-<lang2>-preview file

Creates preview file with translations coefficients:

$$\sum_{i=1}^{number of simple paths} \exp^{-len(path_i)}$$

   python graph.py preview eng spa
   python graph.py preview eng spa --topn 10 --cutoff 4

Topn - number of best translations for a word.

Default 'auto' mode (from docstrings)

   "auto"
   
   If there are 10+ candidates returns those that have coefficient
   more than average. Usually there are top variants and other
   variants have very low coefficient. So it filters relevant
   candidates based on particular case coefficients
   
   If there are less than 10 candidates, adds coefficients with
   minimal coefficient to get more reliable data. And then it returns
   same top candidates.

Normal work:

   $ python graph.py preview eng spa
   2018-07-28 17:34:06,891 | INFO : Initialization (~1 min)
   100%|███████████████████████| 241407/241407 [01:45<00:00, 2296.07it/s]
   4%|█▌                     | 6065/136182 [00:03<01:12, 1806.89it/s]

Here you can see a large difference in number of words in dictionaries. It can be explained as more variance in tags in English or there are more words because there are a lot of dictionaries. There are also a lot of names and other proper nouns in good languages.

Preview file:

   New Delhi	np	Nueva Delhi	np	0.049787068367863944	0
   oblivion	n	olvido	n	0.4074898724158772	0.4142278194149627

First line means that there is a path from 'New Delhi' to 'Nueva Delhi', but only in this direction. Second line means mutual paths with good coefficient. You can edit tags where there are different variants (n$n-f).

Step 8

Time : ~ several seconds
Changed files : -
New files : <lang1>-<lang2>-new file

Convert preview file into dix format (one section):

   python graph.py convert <lang1> <lang2>
   python graph.py convert eng spa

This creates result section for bilingual dictionary (example with 2 sords above):

Additional functions:

Merging dialects

Time : ~ several seconds
Changed files : -
New files : <LANG1>-<LANG2> file

If you used dialect splitting you can merge these languages:

   python graph.py merge --lang1 spa --lang2 cat cat_val_uni

This creates file with several sections with specified dialect tags.

Evaluation

Time : 4 min for each iteration
Changed files : config file (new), loading file (new)
New files : -

Evaluation (precision, recall, f1):

   python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <n_iter=3> <topn=None (auto)>

n : number of best languages to use
cutoff : how long paths we use (4 recommended) (max)
n_iter : how many iterations of evaluation (word are random so there are some minor differences in results)
topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no)

Example (with default) (with set parameters)

   python graph.py eval eng spa
   python graph.py eval eng spa --n 10 --cutoff 4 --n_iter  10 --topn 1

Normal work:

   $ python graph.py eval eng spa
   2018-07-28 20:09:08,797 | INFO : Start ~ 20 s
   2018-07-28 20:09:28,174 | INFO : Initialization 1 ~ 1 min
   100%|██████████████████████████████████████| 1000/1000 [02:49<00:00,  5.91it/s]
   N=1000
   Precision : 0.9811912225705329, recall : 0.939, f1-score : 0.9596320899335717
   ...

Addition

Time : 4 min for each iteration
Changed files : config file (new), loading file (new)
New files : -

Check how many entries we can add (one-side for 2 languages)

   python graph.py add <lang1> <lang2> <n=10> <cutoff=4>

   python graph.py add eng spa
   python graph.py add eng spa --n 10 --cutoff 4

Normal work:

   $ python graph.py add eng spa
   2018-07-29 10:22:13,878 | INFO : Initialization ~ 1 min
   100%|████████████████████████████████| 241407/241407 [00:24<00:00, 9662.90it/s]
   eng->spa    Exist: 29483, failed: 19083, NEW: 15077 +51.0%, NA: 177764
   100%|████████████████████████████████| 136182/136182 [00:29<00:00, 4682.65it/s]
   spa->eng    Exist: 30342, failed: 63086, NEW: 19197 +63.0%, NA: 23557

Example lemma search

Time : 1 min + some time (seconds) for translation
Changed files : -
New files : some output file or no files (if stdout)

Arguments:

lang1, lang2 - languages
--config - start from creating configuration file
--load - start from creating loading file with existing configuration file
--cutoff - cutoff
--topn - topn parameter (int or None by default 'auto' mode)
--n - if create loading file, how many top languages we use
--input - file with words one per line or with spaces
--output - output file, default=sys.stdout

Example:

   python graph.py example eng spa --lang spa --input input.txt --output output.txt

input.txt

   casa
   sangre
   frutilla

Command line:

   $ python graph.py example eng spa --lang spa --input input.txt --output output.txt
   2018-07-30 11:36:13,198 | INFO : Initialization ~1 min
   2018-07-30 11:37:12,922 | INFO : Translating
   100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  7.08it/s]

output.txt

   Lemma: casa
       spa$casa$[n-f-ND]
   eng$home$[n_n-sg]	0.08379690677106663
   eng$house$[n-ND]	0.049787068367863944
   eng$house$[n_n-sg]	0.0404276819945128
   eng$publisher$[n_n-ND]	0.03852947988599058
   
       spa$casa$[n-f_n_n-f-sg]
   eng$house$[n_n-sg]	1.438577234023077
   eng$home$[n_n-sg]	1.330046844771966
   
   ---------------------------------------------
   Lemma: sangre
       spa$sangre$[n-f_n_n-f-sg]
   eng$blood$[n_n-unc]	1.109995929768636
   
       spa$sangre$[n-f-ND]
   eng$blood$[n_n-unc]	0.052005373884161515
   eng$blood$[n-ND]	0.049787068367863944
   
   ---------------------------------------------
   Lemma: frutilla
   ---------------------------------------------

If only input file is specified, this output file content is printed in stdout

Choosing parameters

If you want to choose parameters like cutoff, number of languages and top-N number you can run grid search with all combinations of parameters. It shows how many entries you can add, precision, recall and f1-score.

Time : from several minutes (depending on how many combinations of parameters)
Changed or new files : config file (new), loading file (new)

Evaluation (precision, recall, f1) + addition:

   python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <topn=None (auto)>

n : number of best languages to use
cutoff : how long paths we use (max)
topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no) ('auto' mode is always added)

Example (with default) (with set parameters)

   python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5

Normal work:

   $ python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5
   n: 3    cutoff: 2
   eng->spa    Exist: 29422, failed: 14818, NEW: 4497 +15.0%, NA: 197440
   spa->eng    Exist: 30331, failed: 26858, NEW: 4187 +14.0%, NA: 74806
   topn: 1  N items: 1000          Precision : 0.8942505133470225, recall : 0.871, f1-score : 0.8824721377912866
   topn: 2  N items: 1000          Precision : 0.9565656565656566, recall : 0.947, f1-score : 0.9517587939698493
   topn: 5  N items: 1000          Precision : 0.9622833843017329, recall : 0.944, f1-score : 0.9530540131246845
   topn: None       N items: 1000          Precision : 0.10256410256410256, recall : 0.004, f1-score : 0.0076997112608277185
   ===============================================================
   ... some combinations...
   ===============================================================
   n: 9    cutoff: 4
   eng->spa    Exist: 29422, failed: 19166, NEW: 15030 +51.0%, NA: 182559
   spa->eng    Exist: 30331, failed: 62037, NEW: 18644 +61.0%, NA: 25170
   topn: 1  N items: 1000          Precision : 0.8862660944206009, recall : 0.826, f1-score : 0.855072463768116
   topn: 2  N items: 1000          Precision : 0.9614583333333333, recall : 0.923, f1-score : 0.9418367346938776
   topn: 5  N items: 1000          Precision : 0.9905759162303664, recall : 0.946, f1-score : 0.9677749360613811
   topn: None       N items: 1000          Precision : 0.9728317659352143, recall : 0.931, f1-score : 0.9514563106796118
   ===============================================================
   ... etc...

Bilingual dictionary enrichment via graph completion

Contents

Instruction

Intro

Step 0 : installing

Step 1-2 : downloading

Step 3 : choose languages

Step 4 : preprocessing

Step 5-8 : working with language pair

Additional functions:

Choosing parameters

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools