Bilingual dictionary enrichment via graph completion

From Apertium
Jump to navigation Jump to search


Step 0 : installing

Please mind Python version! In this case Python3


  • Requests : downloading dictionaries and working with Github (downloading)
  • Github : working with Github (downloading)
  • tqdm : progress bars
  • NetworkX : graph library

   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm

Clone repository.


   |---   = some additional data  (language codes, wrong filenames)
   |---   = markdown file with function descriptions
   |---   = main file with functions
   .gitignore     = gitignore      = readme
   download.txt   = list of files if one doesn't have Github account for updating       = command line interface

To use this tool in command line use

This instrument works with Apertium bilingual dictionaries. It is recommended to download all dictionaries, but you can use those that you have locally.

Step 1-2 : downloading

Step 1

  • Time : 2 min
  • Changed files : download.txt
  • New files : -
  • download.txt file contains list of urls of bilingual dictionaries from Apertium Github: ... Option 1: you have Github account and you want absolutely up-to-date list of dictionaries. You need to update this list. Github has API that allows going through repositories. Number of non-authorized queries is very small (60) and there is no way to make it. python update <username> <password> Your username and password isn't stored it is used one time. Actions:
    • get list of Apertium repositories
    • get list of contents for each repository

    After running you'll see this:

       $ python update <your username> <your password>
       4%|██▌                          | 21/490 [00:04<01:44,  4.48it/s]

    Option 2: you don't have Github account or you're ok with current list of dictionaries. You skip updating and go to the next step.

    Step 2

    • Time : 3 min
    • Changed files : -
    • New files : 'dictionaries' folder with bilingual dictionaries

    Now you have list of dictionaries that will be downloaded. Just run the code below. It takes urls from download.txt and saves dictionaries in a new folder ('dictionaries').

    If you don't want to download all dictionaries (you want to use certain list of dictionaries) you can edit download.txt and save only those you want to use and then run the code.

       python download

    After running this you'll see:

       $ python download
       6%|███                           | 18/293 [00:09<02:26,  1.87it/s]

    This will take several minutes. Meanwhile you can open this directory and find new 'diсtionaries' folder where all files appear (about 290 files).

       $ python download
       100%|███████████████████████████| 293/293 [02:38<00:00,  1.85it/s]

    Now all files are downloaded.

    Step 3 : choose languages

    • Time : several seconds (if default), longer if user directory is used
    • Changed files : -
    • New files : filelist.txt

    Create list of dictionaries that will be used for translation.

    Option 1: if you downloaded them:

       python list

    Option 2: search for dictionaries in some local folder. Give some folder name:

       python list --path C:/

    Option 3: manually create 'filelist.txt' and write absolute paths to these ditionaries.

    If you want to specify dialects (por, nor, cat, eng) then add '--dialects True' to this function. This will divide files, e.g. spa-cat -> spa-cat + spa-val

    Result example (filelist.txt):


    Step 4 : preprocessing

    • Time : 2 h
    • Changed files : -
    • New files : 'parsed' folder with changed bilingual dictionaries and 'monodix' folder with monolingual dictionaries containing all words of this language

    If you know for sure what set of dictionaries you want to use, edit filelist.txt and delete those that are not relevant. This will shorten time of preprocessing. In case you are not sure, using all files is recommended.

       python preprocessing

    Normal work looks like this:

       C:\Users\Glaz\Documents\GitHub\GSoC_2018>python preprocessing
       2018-07-27 14:05:38,581 | INFO : Started monolingual dictionaries
       2018-07-27 14:10:21,849 | INFO : Finished monolingual dictionaries
       2018-07-27 14:10:21,883 | INFO : Started bilingual dictionaries
       15%|█████▊                                 | 43/288 [11:41<1:06:34, 16.31s/it]

    Step 5-8 : working with language pair

    Step 5

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-config file

    Configuration file (list of relevant languages). It's better to write in order like in existing dictionary name.

    $$ x = \frac 1{ log10( 10 + DictionaryLength )}$$

    where Dictionary length = BothSides + 0.5 * LR + 0.5 * RL

    It uses these coefficients as edge length. Languages in result file are nodes from top-300 best (shortest) paths between two languages. The length of shortest path with this language is its coefficient by which these languages are sorted. This is used in 'auto' mode when you don't manually select languages and use recommended.

       python config <lang1> <lang2>


       python config eng spa

    Result (eng-spa-config):

       0.22082497988083025	eng	:	eng spa
       0.22082497988083025	spa	:	eng spa
       0.424779904070416	cat	:	eng cat spa
       0.44444829199452307	epo	:	eng epo spa
       0.4520428069728663	glg	:	eng glg spa
       0.4675262097740345	ita	:	eng ita spa
       0.4679562192970136	fra	:	eng fra spa
       0.7466386679162069	rus	:	eng rus epo spa
       0.7557074698418498	lat	:	eng lat spa
       1.0451794809559942	cos	:	eng ita cos spa
       1.051048358042505	dan	:	eng nor dan deu spa

    Step 6

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2> file

    Loading file (contains edges in graph). This file contains information for a graph used in translation.

       python load_file <lang1> <lang2> <n=10>


       python load_file eng spa
       python load_file eng spa --n 10

    eng-spa loading file:

       	eng	dispensary	n$n-sg	spa	ambulatorio	n$n-m
       LR	eng	dispensation	n	spa	administración	n-f$n$n-f-sg

    Step 7

    • Time : ~ 5 min
    • Changed files : -
    • New files : <lang1>-<lang2>-preview file

    Creates preview file with translations coefficients:

    $$\sum_{i=1}^{number of simple paths} \exp^{-len(path_i)}$$

       python preview eng spa
       python preview eng spa --topn 10 --cutoff 4

    Topn - number of best translations for a word.

    Default 'auto' mode (from docstrings)

       If there are 10+ candidates returns those that have coefficient
       more than average. Usually there are top variants and other
       variants have very low coefficient. So it filters relevant
       candidates based on particular case coefficients
       If there are less than 10 candidates, adds coefficients with
       minimal coefficient to get more reliable data. And then it returns
       same top candidates.

    Normal work:

       $ python preview eng spa
       2018-07-28 17:34:06,891 | INFO : Initialization (~1 min)
       100%|███████████████████████| 241407/241407 [01:45<00:00, 2296.07it/s]
       4%|█▌                     | 6065/136182 [00:03<01:12, 1806.89it/s]

    Here you can see a large difference in number of words in dictionaries. It can be explained as more variance in tags in English or there are more words because there are a lot of dictionaries. There are also a lot of names and other proper nouns in good languages.

    Preview file:

       New Delhi	np	Nueva Delhi	np	0.049787068367863944	0
       oblivion	n	olvido	n	0.4074898724158772	0.4142278194149627

    First line means that there is a path from 'New Delhi' to 'Nueva Delhi', but only in this direction. Second line means mutual paths with good coefficient. You can edit tags where there are different variants (n$n-f).

    Step 8

    • Time : ~ several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-new file

    Convert preview file into dix format (one section):

       python convert <lang1> <lang2>
       python convert eng spa

    This creates result section for bilingual dictionary (example with 2 sords above):

    Additional functions:

    Merging dialects

    • Time : ~ several seconds
    • Changed files : -
    • New files : <LANG1>-<LANG2> file

    If you used dialect splitting you can merge these languages:

       python merge --lang1 spa --lang2 cat cat_val_uni

    This creates file with several sections with specified dialect tags.


    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Evaluation (precision, recall, f1):

       python eval <lang1> <lang2> <n=10> <cutoff=4> <n_iter=3> <topn=None (auto)>
    • n : number of best languages to use
    • cutoff : how long paths we use (4 recommended) (max)
    • n_iter : how many iterations of evaluation (word are random so there are some minor differences in results)
    • topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no)

    Example (with default) (with set parameters)

       python eval eng spa
       python eval eng spa --n 10 --cutoff 4 --n_iter  10 --topn 1

    Normal work:

       $ python eval eng spa
       2018-07-28 20:09:08,797 | INFO : Start ~ 20 s
       2018-07-28 20:09:28,174 | INFO : Initialization 1 ~ 1 min
       100%|██████████████████████████████████████| 1000/1000 [02:49<00:00,  5.91it/s]
       Precision : 0.9811912225705329, recall : 0.939, f1-score : 0.9596320899335717


    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Check how many entries we can add (one-side for 2 languages)

       python add <lang1> <lang2> <n=10> <cutoff=4>

       python add eng spa
       python add eng spa --n 10 --cutoff 4

    Normal work:

       $ python add eng spa
       2018-07-29 10:22:13,878 | INFO : Initialization ~ 1 min
       100%|████████████████████████████████| 241407/241407 [00:24<00:00, 9662.90it/s]
       eng->spa    Exist: 29483, failed: 19083, NEW: 15077 +51.0%, NA: 177764
       100%|████████████████████████████████| 136182/136182 [00:29<00:00, 4682.65it/s]
       spa->eng    Exist: 30342, failed: 63086, NEW: 19197 +63.0%, NA: 23557

    Example lemma search

    • Time : 1 min + some time (seconds) for translation
    • Changed files : -
    • New files : some output file or no files (if stdout)


    • lang1, lang2 - languages
    • --config - start from creating configuration file
    • --load - start from creating loading file with existing configuration file
    • --cutoff - cutoff
    • --topn - topn parameter (int or None by default 'auto' mode)
    • --n - if create loading file, how many top languages we use
    • --input - file with words one per line or with spaces
    • --output - output file, default=sys.stdout


       python example eng spa --lang spa --input input.txt --output output.txt



    Command line:

       $ python example eng spa --lang spa --input input.txt --output output.txt
       2018-07-30 11:36:13,198 | INFO : Initialization ~1 min
       2018-07-30 11:37:12,922 | INFO : Translating
       100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  7.08it/s]


       Lemma: casa
       eng$home$[n_n-sg]	0.08379690677106663
       eng$house$[n-ND]	0.049787068367863944
       eng$house$[n_n-sg]	0.0404276819945128
       eng$publisher$[n_n-ND]	0.03852947988599058
       eng$house$[n_n-sg]	1.438577234023077
       eng$home$[n_n-sg]	1.330046844771966
       Lemma: sangre
       eng$blood$[n_n-unc]	1.109995929768636
       eng$blood$[n_n-unc]	0.052005373884161515
       eng$blood$[n-ND]	0.049787068367863944
       Lemma: frutilla

    If only input file is specified, this output file content is printed in stdout