Bilingual dictionary enrichment via graph completion
Contents
Instruction[edit]
Intro[edit]
This tool allows bilingual dictionary enrichment using graph built from bilingual dictionaries. For exmaple, you want to translate église from French to Russian but you don't have this entry. You have: FRA_église - CAT_església, FRA_église - SPA_iglesia, CAT_església - ENG_church, SPA_iglesia - ENG_church, ENG_churh - RUS_церковь
Conneting these edges you get two paths FRA_église - CAT_església - ENG_church - RUS_церковь and FRA_église - SPA_iglesia - ENG_church - RUS_церковь.
Basic steps:
- - 0: installing : cloning repository, installing libraries needed
- - 1-2: downloading : updating list of dictionaries, downloading
- - 3: choosing languages : choose languages you want to use in translation
- - 4: preprocessing : dictionary data needs some changes to be used in a graph, this step prepares it for further usage
- - 5: configuration file : this file recommends what languages will be the most efficient to enrich particular language pair
- - 6: loading file : this file contains all edges of our graph and is used for faster loading graph so we can work with it
- - 7: preview : this file shows result of enrichment - all entries tht can be added. You can see coefficients of likelihood of these variants.
- - 8: convertation : this step simply creates a xml section that can be inserted in original dictionary
Additional steps:
- - Merging dialects : if you wanted to specify all dialects (they should be treated separately), this funtion merges sections into one file with tags that name dialects.
- - Evaluation : this function calculates precision, recall and f1-score for a particular language pair. Method: leave-one-out.
- - Addition : calculates how many entries we can add
- - Example lemma search : if you want to see how it works (variants with coefficients) this is more human-readable version of translation.
- - Choosing parameters : if you want to tune parameters, this function allows trying all combinations of parameters you want to check so you achieve more accurate results.
Step 0 : installing[edit]
Please mind Python version! In this case Python3
Libraries:
- Requests : downloading dictionaries and working with Github (downloading)
- Github : working with Github (downloading)
- tqdm : progress bars
- NetworkX : graph library
pip install requests pip install networkx pip install pygithub pip install tqdm
You can work in the virtual environment:
git clone https://github.com/dkbrz/GSoC_2018_final virtualenv -p python3 GSoC_2018_final/ cd GSoC_2018_final/ source bin/activate pip install requests pip install networkx pip install pygithub pip install tqdm
Clone repository.
Files:
/tool/ |--- __init__.py |--- data.py = some additional data (language codes, wrong filenames) |--- docs.md = markdown file with function descriptions |--- func.py = main file with functions .gitignore = gitignore README.md = readme download.txt = list of files if one doesn't have Github account for updating graph.py = command line interface
To use this tool in command line use graph.py:
This instrument works with Apertium bilingual dictionaries. It is recommended to download all dictionaries, but you can use those that you have locally.
Step 1-2 : downloading[edit]
Step 1
- Time : 2 min
- Changed files : download.txt
- New files : - download.txt file contains list of urls of bilingual dictionaries from Apertium Github: https://raw.githubusercontent.com/apertium/apertium-afr-nld/master/apertium-afr-nld.afr-nld.dix https://raw.githubusercontent.com/apertium/apertium-ara-heb/master/apertium-ara-heb.ara-heb.dix ... Option 1: you have Github account and you want absolutely up-to-date list of dictionaries. You need to update this list. Github has API that allows going through repositories. Number of non-authorized queries is very small (60) and there is no way to make it without authentification, because there are about 300 folders we need to check to find bilingual dictionaries there. python graph.py update Then it will ask for username and password. Password is hidden, so no one would see it.
- get list of Apertium repositories
- get list of contents for each repository
After running you'll see this:
$ python graph.py update Username: YourUsername Password: 4%|██▌ | 21/490 [00:04<01:44, 4.48it/s]
Option 2: you don't have Github account or you're ok with current list of dictionaries. You skip updating and go to the next step.
Step 2
- Time : 3 min
- Changed files : -
- New files : 'dictionaries' folder with bilingual dictionaries
Now you have list of dictionaries that will be downloaded. Just run the code below. It takes urls from download.txt and saves dictionaries in a new folder ('dictionaries').
If you don't want to download all dictionaries (you want to use certain list of dictionaries) you can edit download.txt and save only those you want to use and then run the code.
python graph.py download
After running this you'll see:
$ python graph.py download 6%|███ | 18/293 [00:09<02:26, 1.87it/s]
This will take several minutes. Meanwhile you can open this directory and find new 'diсtionaries' folder where all files appear (about 290 files).
$ python graph.py download 100%|███████████████████████████| 293/293 [02:38<00:00, 1.85it/s]
Now all files are downloaded.
Step 3 : choose languages[edit]
- Time : several seconds (if default), longer if user directory is used
- Changed files : -
- New files : filelist.txt
Create list of dictionaries that will be used for translation.
Option 1: if you downloaded them:
python graph.py list
Option 2: search for dictionaries in some local folder. Give some folder name:
python graph.py list --path C:/
Option 3: manually create 'filelist.txt' and write absolute paths to these ditionaries.
If you want to specify dialects (por, nor, cat, eng) then add '--dialects True' to this function. This will divide files, e.g. spa-cat -> spa-cat + spa-val
Result example (filelist.txt):
C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-afr-nld.afr-nld.dix C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-ara-heb.ara-heb.dix
Step 4 : preprocessing[edit]
- Time : 15 min
- Changed files : -
- New files : 'parsed' folder with changed bilingual dictionaries and 'monodix' folder with monolingual dictionaries containing all words of this language
This step is needed because we handle complex data and direct usage of bilingual dictionaries is slow (parsing files every time) and inaccurate (different tag varinats n vs n-m for the same word). So to increase speed of further functions, we need to preprocess files to format that allows faster and more accurate work.
If you know for sure what set of dictionaries you want to use, edit filelist.txt and delete those that are not relevant. This will shorten time of preprocessing. In case you are not sure, using all files is recommended.
python graph.py preprocessing
Normal work looks like this:
$ python graph.py preprocessing 2018-08-04 21:39:18,174 | INFO : Started monolingual dictionaries 100%|███████████████████████████████████████████████| 145/145 [03:14<00:00, 1.34s/it] 2018-08-04 21:42:32,682 | INFO : Finished monolingual dictionaries 2018-08-04 21:42:32,683 | INFO : Started bilingual dictionaries 6%|██▉ | 18/289 [00:30<07:45, 1.72s/it]
Step 5-8 : working with language pair[edit]
Step 5
- Time : several seconds
- Changed files : -
- New files : <lang1>-<lang2>-config file
Configuration file (list of relevant languages). It's better to write in order like in existing dictionary name.
$$ x = \frac 1{ log10( 10 + DictionaryLength )}$$
where Dictionary length = BothSides + 0.5 * LR + 0.5 * RL
It uses these coefficients as edge length. Languages in result file are nodes from top-300 best (shortest) paths between two languages. The length of shortest path with this language is its coefficient by which these languages are sorted. This is used in 'auto' mode when you don't manually select languages and use recommended.
python graph.py config <lang1> <lang2>
Example:
python graph.py config eng spa
Result (eng-spa-config):
0.22082497988083025 eng : eng spa 0.22082497988083025 spa : eng spa 0.424779904070416 cat : eng cat spa 0.44444829199452307 epo : eng epo spa 0.4520428069728663 glg : eng glg spa 0.4675262097740345 ita : eng ita spa 0.4679562192970136 fra : eng fra spa ... 0.7466386679162069 rus : eng rus epo spa 0.7557074698418498 lat : eng lat spa ... 1.0451794809559942 cos : eng ita cos spa 1.051048358042505 dan : eng nor dan deu spa
Step 6
- Time : several seconds
- Changed files : -
- New files : <lang1>-<lang2> file
Loading file (contains edges in graph). This file contains information for a graph used in translation.
python graph.py load_file <lang1> <lang2> <n=10>
Example:
python graph.py load_file eng spa python graph.py load_file eng spa --n 10
eng-spa loading file:
... eng dispensary n$n-sg spa ambulatorio n$n-m LR eng dispensation n spa administración n-f$n$n-f-sg ...
Step 7
- Time : ~ 5 min
- Changed files : -
- New files : <lang1>-<lang2>-preview file
Creates preview file with translations coefficients:
$$\sum_{i=1}^{number of simple paths} \exp^{-len(path_i)}$$
python graph.py preview eng spa python graph.py preview eng spa --topn 10 --cutoff 4
Topn - number of best translations for a word.
Default 'auto' mode (from docstrings)
"auto" If there are 10+ candidates returns those that have coefficient more than average. Usually there are top variants and other variants have very low coefficient. So it filters relevant candidates based on particular case coefficients If there are less than 10 candidates, adds coefficients with minimal coefficient to get more reliable data. And then it returns same top candidates.
Normal work:
$ python graph.py preview eng spa 2018-07-28 17:34:06,891 | INFO : Initialization (~1 min) 100%|███████████████████████| 241407/241407 [01:45<00:00, 2296.07it/s] 4%|█▌ | 6065/136182 [00:03<01:12, 1806.89it/s]
Here you can see a large difference in number of words in dictionaries. It can be explained as more variance in tags in English or there are more words because there are a lot of dictionaries. There are also a lot of names and other proper nouns in good languages.
Preview file:
New Delhi np Nueva Delhi np 0.049787068367863944 0 oblivion n olvido n 0.4074898724158772 0.4142278194149627
First line means that there is a path from 'New Delhi' to 'Nueva Delhi', but only in this direction. Second line means mutual paths with good coefficient. You can edit tags where there are different variants (n$n-f).
Step 8
- Time : ~ several seconds
- Changed files : -
- New files : <lang1>-<lang2>-new file
Convert preview file into dix format (one section):
python graph.py convert <lang1> <lang2> python graph.py convert eng spa
This creates result section for bilingual dictionary (example with 2 sords above):
Additional functions:[edit]
Merging dialects
- Time : ~ several seconds
- Changed files : -
- New files : <LANG1>-<LANG2> file
If you used dialect splitting you can merge these languages:
python graph.py merge --lang1 spa --lang2 cat cat_val_uni
This creates file with several sections with specified dialect tags.
Evaluation
- Time : 4 min for each iteration
- Changed files : config file (new), loading file (new)
- New files : -
Evaluation (precision, recall, f1):
python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <n_iter=3> <topn=None (auto)>
- n : number of best languages to use
- cutoff : how long paths we use (4 recommended) (max)
- n_iter : how many iterations of evaluation (word are random so there are some minor differences in results)
- topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no)
Example (with default) (with set parameters)
python graph.py eval eng spa python graph.py eval eng spa --n 10 --cutoff 4 --n_iter 10 --topn 1
Normal work:
$ python graph.py eval eng spa 2018-07-28 20:09:08,797 | INFO : Start ~ 20 s 2018-07-28 20:09:28,174 | INFO : Initialization 1 ~ 1 min 100%|██████████████████████████████████████| 1000/1000 [02:49<00:00, 5.91it/s] N=1000 Precision : 0.9811912225705329, recall : 0.939, f1-score : 0.9596320899335717 ...
Addition
- Time : 4 min for each iteration
- Changed files : config file (new), loading file (new)
- New files : -
Check how many entries we can add (one-side for 2 languages)
python graph.py add <lang1> <lang2> <n=10> <cutoff=4>
python graph.py add eng spa python graph.py add eng spa --n 10 --cutoff 4
Normal work:
$ python graph.py add eng spa 2018-07-29 10:22:13,878 | INFO : Initialization ~ 1 min 100%|████████████████████████████████| 241407/241407 [00:24<00:00, 9662.90it/s] eng->spa Exist: 29483, failed: 19083, NEW: 15077 +51.0%, NA: 177764 100%|████████████████████████████████| 136182/136182 [00:29<00:00, 4682.65it/s] spa->eng Exist: 30342, failed: 63086, NEW: 19197 +63.0%, NA: 23557
Example lemma search
- Time : 1 min + some time (seconds) for translation
- Changed files : -
- New files : some output file or no files (if stdout)
Arguments:
- lang1, lang2 - languages
- --config - start from creating configuration file
- --load - start from creating loading file with existing configuration file
- --cutoff - cutoff
- --topn - topn parameter (int or None by default 'auto' mode)
- --n - if create loading file, how many top languages we use
- --input - file with words one per line or with spaces
- --output - output file, default=sys.stdout
Example:
python graph.py example eng spa --lang spa --input input.txt --output output.txt
input.txt
casa sangre frutilla
Command line:
$ python graph.py example eng spa --lang spa --input input.txt --output output.txt 2018-07-30 11:36:13,198 | INFO : Initialization ~1 min 2018-07-30 11:37:12,922 | INFO : Translating 100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 7.08it/s]
output.txt
Lemma: casa spa$casa$[n-f-ND] eng$home$[n_n-sg] 0.08379690677106663 eng$house$[n-ND] 0.049787068367863944 eng$house$[n_n-sg] 0.0404276819945128 eng$publisher$[n_n-ND] 0.03852947988599058 spa$casa$[n-f_n_n-f-sg] eng$house$[n_n-sg] 1.438577234023077 eng$home$[n_n-sg] 1.330046844771966 --------------------------------------------- Lemma: sangre spa$sangre$[n-f_n_n-f-sg] eng$blood$[n_n-unc] 1.109995929768636 spa$sangre$[n-f-ND] eng$blood$[n_n-unc] 0.052005373884161515 eng$blood$[n-ND] 0.049787068367863944 --------------------------------------------- Lemma: frutilla ---------------------------------------------
If only input file is specified, this output file content is printed in stdout
Choosing parameters[edit]
If you want to choose parameters like cutoff, number of languages and top-N number you can run grid search with all combinations of parameters. It shows how many entries you can add, precision, recall and f1-score.
- Time : from several minutes (depending on how many combinations of parameters)
- Changed or new files : config file (new), loading file (new)
Evaluation (precision, recall, f1) + addition:
python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <topn=None (auto)>
- n : number of best languages to use
- cutoff : how long paths we use (max)
- topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no) ('auto' mode is always added)
Example (with default) (with set parameters)
python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5
Normal work:
$ python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5 n: 3 cutoff: 2 eng->spa Exist: 29422, failed: 14818, NEW: 4497 +15.0%, NA: 197440 spa->eng Exist: 30331, failed: 26858, NEW: 4187 +14.0%, NA: 74806 topn: 1 N items: 1000 Precision : 0.8942505133470225, recall : 0.871, f1-score : 0.8824721377912866 topn: 2 N items: 1000 Precision : 0.9565656565656566, recall : 0.947, f1-score : 0.9517587939698493 topn: 5 N items: 1000 Precision : 0.9622833843017329, recall : 0.944, f1-score : 0.9530540131246845 topn: None N items: 1000 Precision : 0.10256410256410256, recall : 0.004, f1-score : 0.0076997112608277185 =============================================================== ... some combinations... =============================================================== n: 9 cutoff: 4 eng->spa Exist: 29422, failed: 19166, NEW: 15030 +51.0%, NA: 182559 spa->eng Exist: 30331, failed: 62037, NEW: 18644 +61.0%, NA: 25170 topn: 1 N items: 1000 Precision : 0.8862660944206009, recall : 0.826, f1-score : 0.855072463768116 topn: 2 N items: 1000 Precision : 0.9614583333333333, recall : 0.923, f1-score : 0.9418367346938776 topn: 5 N items: 1000 Precision : 0.9905759162303664, recall : 0.946, f1-score : 0.9677749360613811 topn: None N items: 1000 Precision : 0.9728317659352143, recall : 0.931, f1-score : 0.9514563106796118 =============================================================== ... etc...