Bilingual dictionary enrichment via graph completion

From Apertium
Jump to navigation Jump to search

Instruction

Step 0 : installing

Please mind Python version! In this case Python3

Libraries:

  • Requests : downloading dictionaries and working with Github (downloading)
  • Github : working with Github (downloading)
  • tqdm : progress bars
  • NetworkX : graph library


   pip install requests
   pip install networkx
   pip install pygithub
   pip install tqdm



Clone repository.

Files:

   /tool/
   |--- __init__.py
   |--- data.py   = some additional data  (language codes, wrong filenames)
   |--- docs.md   = markdown file with function descriptions
   |--- func.py   = main file with functions
   
   .gitignore     = gitignore
   README.md      = readme
   download.txt   = list of files if one doesn't have Github account for updating
   graph.py       = command line interface

To use this tool in command line use graph.py:

This instrument works with Apertium bilingual dictionaries. It is recommended to download all dictionaries, but you can use those that you have locally.

Step 1-2 : downloading

Step 1

  • Time : 2 min
  • Changed files : download.txt
  • New files : -
  • download.txt file contains list of urls of bilingual dictionaries from Apertium Github: https://raw.githubusercontent.com/apertium/apertium-afr-nld/master/apertium-afr-nld.afr-nld.dix https://raw.githubusercontent.com/apertium/apertium-ara-heb/master/apertium-ara-heb.ara-heb.dix ... Option 1: you have Github account and you want absolutely up-to-date list of dictionaries. You need to update this list. Github has API that allows going through repositories. Number of non-authorized queries is very small (60) and there is no way to make it. python graph.py update <username> <password> Your username and password isn't stored it is used one time. Actions:
    • get list of Apertium repositories
    • get list of contents for each repository

    After running you'll see this:

       $ python graph.py update <your username> <your password>
       4%|██▌                          | 21/490 [00:04<01:44,  4.48it/s]
    

    Option 2: you don't have Github account or you're ok with current list of dictionaries. You skip updating and go to the next step.

    Step 2

    • Time : 3 min
    • Changed files : -
    • New files : 'dictionaries' folder with bilingual dictionaries

    Now you have list of dictionaries that will be downloaded. Just run the code below. It takes urls from download.txt and saves dictionaries in a new folder ('dictionaries').

    If you don't want to download all dictionaries (you want to use certain list of dictionaries) you can edit download.txt and save only those you want to use and then run the code.


       python graph.py download
    


    After running this you'll see:

       $ python graph.py download
       6%|███                           | 18/293 [00:09<02:26,  1.87it/s]
    

    This will take several minutes. Meanwhile you can open this directory and find new 'diсtionaries' folder where all files appear (about 290 files).


       $ python graph.py download
       100%|███████████████████████████| 293/293 [02:38<00:00,  1.85it/s]
    

    Now all files are downloaded.

    Step 3 : choose languages

    • Time : several seconds (if default), longer if user directory is used
    • Changed files : -
    • New files : filelist.txt

    Create list of dictionaries that will be used for translation.

    Option 1: if you downloaded them:

       python graph.py list
    

    Option 2: search for dictionaries in some local folder. Give some folder name:

       python graph.py list --path C:/
    

    Option 3: manually create 'filelist.txt' and write absolute paths to these ditionaries.

    If you want to specify dialects (por, nor, cat, eng) then add '--dialects True' to this function. This will divide files, e.g. spa-cat -> spa-cat + spa-val

    Result example (filelist.txt):

       C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-afr-nld.afr-nld.dix
       C:/Users/Username/Documents/GSoC_2018/dictionaries/apertium-ara-heb.ara-heb.dix
    


    Step 4 : preprocessing

    • Time : 2 h
    • Changed files : -
    • New files : 'parsed' folder with changed bilingual dictionaries and 'monodix' folder with monolingual dictionaries containing all words of this language

    If you know for sure what set of dictionaries you want to use, edit filelist.txt and delete those that are not relevant. This will shorten time of preprocessing. In case you are not sure, using all files is recommended.


       python graph.py preprocessing
    

    Normal work looks like this:

       C:\Users\Glaz\Documents\GitHub\GSoC_2018>python graph.py preprocessing
       2018-07-27 14:05:38,581 | INFO : Started monolingual dictionaries
       2018-07-27 14:10:21,849 | INFO : Finished monolingual dictionaries
       2018-07-27 14:10:21,883 | INFO : Started bilingual dictionaries
       15%|█████▊                                 | 43/288 [11:41<1:06:34, 16.31s/it]
    


    Step 5-8 : working with language pair

    Step 5

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-config file

    Configuration file (list of relevant languages). It's better to write in order like in existing dictionary name.

    $$ x = \frac 1{ log10( 10 + DictionaryLength )}$$

    where Dictionary length = BothSides + 0.5 * LR + 0.5 * RL

    It uses these coefficients as edge length. Languages in result file are nodes from top-300 best (shortest) paths between two languages. The length of shortest path with this language is its coefficient by which these languages are sorted. This is used in 'auto' mode when you don't manually select languages and use recommended.

       python graph.py config <lang1> <lang2>
    

    Example:

       python graph.py config eng spa
    

    Result (eng-spa-config):

       0.22082497988083025	eng	:	eng spa
       0.22082497988083025	spa	:	eng spa
       0.424779904070416	cat	:	eng cat spa
       0.44444829199452307	epo	:	eng epo spa
       0.4520428069728663	glg	:	eng glg spa
       0.4675262097740345	ita	:	eng ita spa
       0.4679562192970136	fra	:	eng fra spa
       ...
       0.7466386679162069	rus	:	eng rus epo spa
       0.7557074698418498	lat	:	eng lat spa
       ...
       1.0451794809559942	cos	:	eng ita cos spa
       1.051048358042505	dan	:	eng nor dan deu spa
    

    Step 6

    • Time : several seconds
    • Changed files : -
    • New files : <lang1>-<lang2> file

    Loading file (contains edges in graph). This file contains information for a graph used in translation.

       python graph.py load_file <lang1> <lang2> <n=10>
    

    Example:

       python graph.py load_file eng spa
       python graph.py load_file eng spa --n 10
    

    eng-spa loading file:

       ...
       	eng	dispensary	n$n-sg	spa	ambulatorio	n$n-m
       LR	eng	dispensation	n	spa	administración	n-f$n$n-f-sg
       ...
    

    Step 7

    • Time : ~ 5 min
    • Changed files : -
    • New files : <lang1>-<lang2>-preview file

    Creates preview file with translations coefficients:

    $$\sum_{i=1}^{number of simple paths} \exp^{-len(path_i)}$$


       python graph.py preview eng spa
       python graph.py preview eng spa --topn 10 --cutoff 4
    


    Topn - number of best translations for a word.

    Default 'auto' mode (from docstrings)

       "auto"
       
       If there are 10+ candidates returns those that have coefficient
       more than average. Usually there are top variants and other
       variants have very low coefficient. So it filters relevant
       candidates based on particular case coefficients
       
       If there are less than 10 candidates, adds coefficients with
       minimal coefficient to get more reliable data. And then it returns
       same top candidates.
    


    Normal work:

       $ python graph.py preview eng spa
       2018-07-28 17:34:06,891 | INFO : Initialization (~1 min)
       100%|███████████████████████| 241407/241407 [01:45<00:00, 2296.07it/s]
       4%|█▌                     | 6065/136182 [00:03<01:12, 1806.89it/s]
    


    Here you can see a large difference in number of words in dictionaries. It can be explained as more variance in tags in English or there are more words because there are a lot of dictionaries. There are also a lot of names and other proper nouns in good languages.

    Preview file:

       New Delhi	np	Nueva Delhi	np	0.049787068367863944	0
       oblivion	n	olvido	n	0.4074898724158772	0.4142278194149627
    

    First line means that there is a path from 'New Delhi' to 'Nueva Delhi', but only in this direction. Second line means mutual paths with good coefficient. You can edit tags where there are different variants (n$n-f).

    Step 8

    • Time : ~ several seconds
    • Changed files : -
    • New files : <lang1>-<lang2>-new file

    Convert preview file into dix format (one section):

       python graph.py convert <lang1> <lang2>
       python graph.py convert eng spa
    


    This creates result section for bilingual dictionary (example with 2 sords above):




    Additional functions:

    Merging dialects

    • Time : ~ several seconds
    • Changed files : -
    • New files : <LANG1>-<LANG2> file

    If you used dialect splitting you can merge these languages:

       python graph.py merge --lang1 spa --lang2 cat cat_val_uni
    


    This creates file with several sections with specified dialect tags.


    Evaluation

    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Evaluation (precision, recall, f1):

       python graph.py eval <lang1> <lang2> <n=10> <cutoff=4> <n_iter=3> <topn=None (auto)>
    
    • n : number of best languages to use
    • cutoff : how long paths we use (4 recommended) (max)
    • n_iter : how many iterations of evaluation (word are random so there are some minor differences in results)
    • topn : how many top results for each word we consider to be right (default=None - all that in auto mode, 1 - only best match is right and other variants that can be showed - no)

    Example (with default) (with set parameters)

       python graph.py eval eng spa
       python graph.py eval eng spa --n 10 --cutoff 4 --n_iter  10 --topn 1
    


    Normal work:

       $ python graph.py eval eng spa
       2018-07-28 20:09:08,797 | INFO : Start ~ 20 s
       2018-07-28 20:09:28,174 | INFO : Initialization 1 ~ 1 min
       100%|██████████████████████████████████████| 1000/1000 [02:49<00:00,  5.91it/s]
       N=1000
       Precision : 0.9811912225705329, recall : 0.939, f1-score : 0.9596320899335717
       ...
    


    Addition

    • Time : 4 min for each iteration
    • Changed files : config file (new), loading file (new)
    • New files : -

    Check how many entries we can add (one-side for 2 languages)

       python graph.py add <lang1> <lang2> <n=10> <cutoff=4>
    


       python graph.py add eng spa
       python graph.py add eng spa --n 10 --cutoff 4
    


    Normal work:

       $ python graph.py add eng spa
       2018-07-29 10:22:13,878 | INFO : Initialization ~ 1 min
       100%|████████████████████████████████| 241407/241407 [00:24<00:00, 9662.90it/s]
       eng->spa    Exist: 29483, failed: 19083, NEW: 15077 +51.0%, NA: 177764
       100%|████████████████████████████████| 136182/136182 [00:29<00:00, 4682.65it/s]
       spa->eng    Exist: 30342, failed: 63086, NEW: 19197 +63.0%, NA: 23557
    


    Example lemma search

    • Time : 1 min + some time (seconds) for translation
    • Changed files : -
    • New files : some output file or no files (if stdout)

    Arguments:

    • lang1, lang2 - languages
    • --config - start from creating configuration file
    • --load - start from creating loading file with existing configuration file
    • --cutoff - cutoff
    • --topn - topn parameter (int or None by default 'auto' mode)
    • --n - if create loading file, how many top languages we use
    • --input - file with words one per line or with spaces
    • --output - output file, default=sys.stdout

    Example:

       python graph.py example eng spa --lang spa --input input.txt --output output.txt
    


    input.txt

       casa
       sangre
       frutilla
    


    Command line:

       $ python graph.py example eng spa --lang spa --input input.txt --output output.txt
       2018-07-30 11:36:13,198 | INFO : Initialization ~1 min
       2018-07-30 11:37:12,922 | INFO : Translating
       100%|████████████████████████████████████████████| 3/3 [00:00<00:00,  7.08it/s]
    


    output.txt

       Lemma: casa
           spa$casa$[n-f-ND]
       eng$home$[n_n-sg]	0.08379690677106663
       eng$house$[n-ND]	0.049787068367863944
       eng$house$[n_n-sg]	0.0404276819945128
       eng$publisher$[n_n-ND]	0.03852947988599058
       
           spa$casa$[n-f_n_n-f-sg]
       eng$house$[n_n-sg]	1.438577234023077
       eng$home$[n_n-sg]	1.330046844771966
       
       ---------------------------------------------
       Lemma: sangre
           spa$sangre$[n-f_n_n-f-sg]
       eng$blood$[n_n-unc]	1.109995929768636
       
           spa$sangre$[n-f-ND]
       eng$blood$[n_n-unc]	0.052005373884161515
       eng$blood$[n-ND]	0.049787068367863944
       
       ---------------------------------------------
       Lemma: frutilla
       ---------------------------------------------
    


    If only input file is specified, this output file content is printed in stdout