Bilingual dictionary enrichment via graph completion

From Apertium
Jump to navigation Jump to search

Instruction

Intro

This tool allows bilingual dictionary enrichment using graph built from bilingual dictionaries. For exmaple, you want to translate église from French to Russian but you don't have this entry. You have: FRA_église - CAT_església, FRA_église - SPA_iglesia, CAT_església - ENG_church, SPA_iglesia - ENG_church, ENG_churh - RUS_церковь

Conneting these edges you get two paths FRA_église - CAT_església - ENG_church - RUS_церковь and FRA_église - SPA_iglesia - ENG_church - RUS_церковь.


Basic steps:

  • - 0: installing : cloning repository, installing libraries needed
  • - 1-2: downloading : updating list of dictionaries, downloading
  • - 3: choosing languages : choose languages you want to use in translation
  • - 4: preprocessing : dictionary data needs some changes to be used in a graph, this step prepares it for further usage
  • - 5: configuration file : this file recommends what languages will be the most efficient to enrich particular language pair
  • - 6: loading file : this file contains all edges of our graph and is used for faster loading graph so we can work with it
  • - 7: preview : this file shows result of enrichment - all entries tht can be added. You can see coefficients of likelihood of these variants.
  • - 8: convertation : this step simply creates a xml section that can be inserted in original dictionary

Additional steps:

  • - Merging dialects : if you wanted to specify all dialects (they should be treated separately), this funtion merges sections into one file with tags that name dialects.
  • - Evaluation : this function calculates precision, recall and f1-score for a particular language pair. Method: leave-one-out.
  • - Addition : calculates how many entries we can add
  • - Example lemma search : if you want to see how it works (variants with coefficients) this is more human-readable version of translation.
  • - Choosing parameters : if you want to tune parameters, this function allows trying all combinations of parameters you want to check so you achieve more accurate results.


Step 0 : installing

Please mind Python version! In this case Python3

Libraries:

  • Requests : downloading dictionaries and working with Github (downloading)
  • Github : working with Github (downloading)
  • tqdm : progress bars
  • NetworkX : graph library


   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm

You can work in the virtual environment:


   git clone https://github.com/dkbrz/GSoC_2018_final
   virtualenv -p python3 GSoC_2018_final/
   cd GSoC_2018_final/
   source bin/activate
   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm

Clone repository.

Files:

   /tool/
   |--- __init__.py
   |--- data.py   = some additional data  (language codes, wrong filenames)
   |--- docs.md   = markdown file with function descriptions
   |--- func.py   = main file with functions
   
   .gitignore     = gitignore
   README.md      = readme
   download.txt   = list of files if one doesn't have Github account for updating
   graph.py       = command line interface

To use this tool in command line use graph.py:

This instrument works with Apertium bilingual dictionaries. It is recommended to download all dictionaries, but you can use those that you have locally.

Step 1-2 : downloading

Step 1

  • Time : 2 min
  • Changed files : download.txt
  • New files : -
  • download.txt file contains list of urls of bilingual dictionaries from Apertium Github: https://raw.githubusercontent.com/apertium/apertium-afr-nld/master/apertium-afr-nld.afr-nld.dix https://raw.githubusercontent.com/apertium/apertium-ara-heb/master/apertium-ara-heb.ara-heb.dix ... Option 1: you have Github account and you want absolutely up-to-date list of dictionaries. You need to update this list. Github has API that allows going through repositories. Number of non-authorized queries is very small (60) and there is no way to make it without authentification, because there are about 300 folders we need to check to find bilingual dictionaries there. python graph.py update Then it will ask for username and password. Password is hidden, so no one would see it.
    • get list of Apertium repositories
    • get list of contents for each repository

    After running you'll see this:

       $ python graph.py update
       Username:   YourUsername
       Password: 
       4%|██▌                          | 21/490 [00:04<01:44,  4.48it/s]
    

    Option 2: you don't have Github account or you're ok with current list of dictionaries. You skip updating and go to the next step.

    Step 2

    • Time : 3 min
    • Changed files : -
    • New files : 'dictionaries' folder with bilingual dictionaries

    Now you have list of dictionaries that will be downloaded. Just run the code below. It takes urls from download.txt and saves dictionaries in a new folder ('dictionaries').

    If you don't want to download all dictionaries (you want to use certain list of dictionaries) you can edit download.txt and save only those you want to use and then run the code.


       python graph.py download
    


    After running this you'll see:

       $ python graph.py download
       6%|███                           | 18/293 [00:09<02:26,  1.87it/s]
    

    This will take several minutes. Meanwhile you can open this directory and find new 'diсtionaries' folder where all files appear (about 290 files).


       $ python graph.py download
       100%|███████████████████████████| 293/293 [02:38<00:00,  1.85it/s]
    

    Now all files are downloaded.

    Step 3 : choose languages

    • Time : several seconds (if default), longer if user directory is used
    • Changed files : -
    • New files : filelist.txt

    Create list of dictionaries that will be used for translation.

    Option 1: if you downloaded them:

       python graph.py list
    

    Option 2: search for dictionaries in some local folder. Give some folder name:

       python graph.py list --path C:/
    

    Option 3: manually create 'filelist.txt' and write absolute paths to these ditionaries.

    If you want to specify dialects (por, nor, cat, eng) then add '--dialects True' to this function. This will divide files, e.g. spa-cat -> spa-cat + spa-val

    Result example (filelist.txt):

       C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-afr-nld.afr-nld.dix
       C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-ara-heb.ara-heb.dix
    


    Step 4 : preprocessing

    • Time : 15 min
    • Changed files : -
    • New files : 'parsed' folder with changed bilingual dictionaries and 'monodix' folder with monolingual dictionaries containing all words of this language

    This step is needed because we handle complex data and direct usage of bilingual dictionaries is slow (parsing files every time) and inaccurate (different tag varinats n vs n-m for the same word). So to increase speed of further functions, we need to preprocess files to format that allows faster and more accurate work.

    If you know for sure what set of dictionaries you want to use, edit filelist.txt and delete those that are not relevant. This will shorten time of preprocessing. In case you are not sure, using all files is recommended.


       python graph.py preprocessing
    

    Normal work looks like this:

       $ python graph.py preprocessing
       2018-08-04 21:39:18,174 | INFO : Started monolingual dictionaries
       100%|███████████████████████████████████████████████| 145/145 [03:14<00:00,  1.34s/it]
       2018-08-04 21:42:32,682 | INFO : Finished monolingual dictionaries
       2018-08-04 21:42:32,683 | INFO : Started bilingual dictionaries
       6%|██▉                                             | 18/289 [00:30<07:45,  1.72s/it]
    

    Step 5-8 : working with language pair

    Step 5

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-config file

    Configuration file (list of relevant languages). It's better to write in order like in existing dictionary name.

    $$ x = \frac 1{ log10( 10 + DictionaryLength )}$$

    where Dictionary length = BothSides + 0.5 * LR + 0.5 * RL

    It uses these coefficients as edge length. Languages in result file are nodes from top-300 best (shortest) paths between two languages. The length of shortest path with this language is its coefficient by which these languages are sorted. This is used in 'auto' mode when you don't manually select languages and use recommended.

       python graph.py config <lang1> <lang2>
    

    Example:

       python graph.py config eng spa
    

    Result (eng-spa-config):

       0.22082497988083025	eng	:	eng spa
       0.22082497988083025	spa	:	eng spa
       0.424779904070416	cat	:	eng cat spa
       0.44444829199452307	epo	:	eng epo spa
       0.4520428069728663	glg	:	eng glg spa
       0.4675262097740345	ita	:	eng ita spa
       0.4679562192970136	fra	:	eng fra spa
       ...
       0.7466386679162069	rus	:	eng rus epo spa
       0.7557074698418498	lat	:	eng lat spa
       ...
       1.0451794809559942	cos	:	eng ita cos spa
       1.051048358042505	dan	:	eng nor dan deu spa
    

    Step 6

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2> file

    Loading file (contains edges in graph). This file contains information for a graph used in translation.

       python graph.py load_file <lang1> <lang2> <n=10>
    

    Example:

       python graph.py load_file eng spa
       python graph.py load_file eng spa --n 10
    

    eng-spa loading file:

       ...
       	eng	dispensary	n$n-sg	spa	ambulatorio	n$n-m
       LR	eng	dispensation	n	spa	administración	n-f$n$n-f-sg
       ...
    

    Step 7

    • Time : ~ 5 min
    • Changed files : -
    • New files : <lang1>-<lang2>-preview file

    Creates preview file with translations coefficients:

    $$\sum_{i=1}^{number of simple paths} \exp^{-len(path_i)}$$


       python graph.py preview eng spa
       python graph.py preview eng spa --topn 10 --cutoff 4
    


    Topn - number of best translations for a word.

    Default 'auto' mode (from docstrings)

       "auto"
       
       If there are 10+ candidates returns those that have coefficient
       more than average. Usually there are top variants and other
       variants have very low coefficient. So it filters relevant
       candidates based on particular case coefficients
       
       If there are less than 10 candidates, adds coefficients with
       minimal coefficient to get more reliable data. And then it returns
       same top candidates.
    


    Normal work:

       $ python graph.py preview eng spa
       2018-07-28 17:34:06,891 | INFO : Initialization (~1 min)
       100%|███████████████████████| 241407/241407 [01:45<00:00, 2296.07it/s]
       4%|█▌                     | 6065/136182 [00:03<01:12, 1806.89it/s]
    


    Here you can see a large difference in number of words in dictionaries. It can be explained as more variance in tags in English or there are more words because there are a lot of dictionaries. There are also a lot of names and other proper nouns in good languages.

    Preview file:

       New Delhi	np	Nueva Delhi	np	0.049787068367863944	0
       oblivion	n	olvido	n	0.4074898724158772	0.4142278194149627
    

    First line means that there is a path from 'New Delhi' to 'Nueva Delhi', but only in this direction. Second line means mutual paths with good coefficient. You can edit tags where there are different variants (n$n-f).

    Step 8

    • Time : ~ several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-new file

    Convert preview file into dix format (one section):

       python graph.py convert <lang1> <lang2>
       python graph.py convert eng spa
    


    This creates result section for bilingual dictionary (example with 2 sords above):




    Additional functions:

    Merging dialects

    • Time : ~ several seconds
    • Changed files : -
    • New files : <LANG1>-<LANG2> file

    If you used dialect splitting you can merge these languages:

       python graph.py merge --lang1 spa --lang2 cat cat_val_uni
    


    This creates file with several sections with specified dialect tags.


    Evaluation

    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Evaluation (precision, recall, f1):

       python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <n_iter=3> <topn=None (auto)>
    
    • n : number of best languages to use
    • cutoff : how long paths we use (4 recommended) (max)
    • n_iter : how many iterations of evaluation (word are random so there are some minor differences in results)
    • topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no)

    Example (with default) (with set parameters)

       python graph.py eval eng spa
       python graph.py eval eng spa --n 10 --cutoff 4 --n_iter  10 --topn 1
    


    Normal work:

       $ python graph.py eval eng spa
       2018-07-28 20:09:08,797 | INFO : Start ~ 20 s
       2018-07-28 20:09:28,174 | INFO : Initialization 1 ~ 1 min
       100%|██████████████████████████████████████| 1000/1000 [02:49<00:00,  5.91it/s]
       N=1000
       Precision : 0.9811912225705329, recall : 0.939, f1-score : 0.9596320899335717
       ...
    


    Addition

    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Check how many entries we can add (one-side for 2 languages)

       python graph.py add <lang1> <lang2> <n=10> <cutoff=4>
    


       python graph.py add eng spa
       python graph.py add eng spa --n 10 --cutoff 4
    


    Normal work:

       $ python graph.py add eng spa
       2018-07-29 10:22:13,878 | INFO : Initialization ~ 1 min
       100%|████████████████████████████████| 241407/241407 [00:24<00:00, 9662.90it/s]
       eng->spa    Exist: 29483, failed: 19083, NEW: 15077 +51.0%, NA: 177764
       100%|████████████████████████████████| 136182/136182 [00:29<00:00, 4682.65it/s]
       spa->eng    Exist: 30342, failed: 63086, NEW: 19197 +63.0%, NA: 23557
    


    Example lemma search

    • Time : 1 min + some time (seconds) for translation
    • Changed files : -
    • New files : some output file or no files (if stdout)

    Arguments:

    • lang1, lang2 - languages
    • --config - start from creating configuration file
    • --load - start from creating loading file with existing configuration file
    • --cutoff - cutoff
    • --topn - topn parameter (int or None by default 'auto' mode)
    • --n - if create loading file, how many top languages we use
    • --input - file with words one per line or with spaces
    • --output - output file, default=sys.stdout

    Example:

       python graph.py example eng spa --lang spa --input input.txt --output output.txt
    


    input.txt

       casa
       sangre
       frutilla
    


    Command line:

       $ python graph.py example eng spa --lang spa --input input.txt --output output.txt
       2018-07-30 11:36:13,198 | INFO : Initialization ~1 min
       2018-07-30 11:37:12,922 | INFO : Translating
       100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  7.08it/s]
    


    output.txt

       Lemma: casa
           spa$casa$[n-f-ND]
       eng$home$[n_n-sg]	0.08379690677106663
       eng$house$[n-ND]	0.049787068367863944
       eng$house$[n_n-sg]	0.0404276819945128
       eng$publisher$[n_n-ND]	0.03852947988599058
       
           spa$casa$[n-f_n_n-f-sg]
       eng$house$[n_n-sg]	1.438577234023077
       eng$home$[n_n-sg]	1.330046844771966
       
       ---------------------------------------------
       Lemma: sangre
           spa$sangre$[n-f_n_n-f-sg]
       eng$blood$[n_n-unc]	1.109995929768636
       
           spa$sangre$[n-f-ND]
       eng$blood$[n_n-unc]	0.052005373884161515
       eng$blood$[n-ND]	0.049787068367863944
       
       ---------------------------------------------
       Lemma: frutilla
       ---------------------------------------------
    


    If only input file is specified, this output file content is printed in stdout

    Choosing parameters

    If you want to choose parameters like cutoff, number of languages and top-N number you can run grid search with all combinations of parameters. It shows how many entries you can add, precision, recall and f1-score.

    • Time : from several minutes (depending on how many combinations of parameters)
    • Changed or new files : config file (new), loading file (new)

    Evaluation (precision, recall, f1) + addition:

       python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <topn=None (auto)>
    
    • n : number of best languages to use
    • cutoff : how long paths we use (max)
    • topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no) ('auto' mode is always added)

    Example (with default) (with set parameters)


       python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5
    


    Normal work:

       $ python graph.py grid eng spa --n 3 5 7 9 12 --cutoff 2 3 4 5 --topn 1 2 5
       n: 3    cutoff: 2
       eng->spa    Exist: 29422, failed: 14818, NEW: 4497 +15.0%, NA: 197440
       spa->eng    Exist: 30331, failed: 26858, NEW: 4187 +14.0%, NA: 74806
       topn: 1  N items: 1000          Precision : 0.8942505133470225, recall : 0.871, f1-score : 0.8824721377912866
       topn: 2  N items: 1000          Precision : 0.9565656565656566, recall : 0.947, f1-score : 0.9517587939698493
       topn: 5  N items: 1000          Precision : 0.9622833843017329, recall : 0.944, f1-score : 0.9530540131246845
       topn: None       N items: 1000          Precision : 0.10256410256410256, recall : 0.004, f1-score : 0.0076997112608277185
       ===============================================================
       ... some combinations...
       ===============================================================
       n: 9    cutoff: 4
       eng->spa    Exist: 29422, failed: 19166, NEW: 15030 +51.0%, NA: 182559
       spa->eng    Exist: 30331, failed: 62037, NEW: 18644 +61.0%, NA: 25170
       topn: 1  N items: 1000          Precision : 0.8862660944206009, recall : 0.826, f1-score : 0.855072463768116
       topn: 2  N items: 1000          Precision : 0.9614583333333333, recall : 0.923, f1-score : 0.9418367346938776
       topn: 5  N items: 1000          Precision : 0.9905759162303664, recall : 0.946, f1-score : 0.9677749360613811
       topn: None       N items: 1000          Precision : 0.9728317659352143, recall : 0.931, f1-score : 0.9514563106796118
       ===============================================================
       ... etc...