User:Shash42/GSoC 2020 Proposal: Bilingual Dictionary Discovery

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2020: Proposal

Bilingual Dictionary Generation via Graph Exploration

Contact Details

About Me

I am an undergraduate research student pursuing Computer Science at the International Institute of Information Technology, Hyderabad. I have been involved in CS and Linguistics through olympiads for the past 4 years.


I am interested in Natural Language Processing, Deep Learning, and Applied Math. I am fascinated by computational solutions where they're least expected. My hobbies are competitive programming, math (problem-solving of any sort really), writing, playing chess and keeping in touch with developments in space exploration.

Why am I interested in Apertium?

Growing up in India, I realized that the very linguistic and cultural diversity that leads to unique perspectives acts as a barrier to inclusion. Technological solutions do not attract large sections of society as they just can’t be communicated to them. Seamless translation systems and the inclusion of regional languages is, therefore, the need of the hour. However, popular deep learning based solutions for machine translation cannot fulfill the task of preserving linguistic diversity, simply because endangered languages don’t offer sufficient data. Apertium is much more suited for this goal. It is inherently targeted at low-resource, similar yet mutually unintelligible language-pairs. It is a dream organization for me because I get to contribute to a cause that is close to my heart.

Besides this, the project that I wish to work on is ideally suited for my interests. It allows me to work on an open research problem, having immense scope for creativity, optimization, and experimentation. It suitably lies at the intersection of two of my favorite topics, graph theory, and semantics. It goes without saying that I find it a perfect fit!


Which task am I interested in? What do I plan to do?

I am interested in implementing Bilingual Dictionary Discovery via Graph Completion for Apertium. At the time of writing this proposal, Apertium has 54 language pairs in the Trunk. These spread across 43 languages. This clearly shows that among these 43 languages, Apertium cannot translate between most randomly chosen language-pairs.

A crucial step in developing a language pair is writing its bilingual dictionary, which in essence maps a lemma X in language A to a lemma Y in language B if X and Y have the same meaning and accompanying lexical analysis. It preserves morphological transfer rules between surface forms of the lemma.

Apertium’s existing dataset has appreciable amounts of data, which can be visualized in a graph-like structure, previously done in the Apertium RDF project (Jorge Gracia et al., 2016). In essence, each <lexical entry, language> pair is visualized as a ‘vertex’ on a graph, with an ‘edge’ between vertices that exist as a relation in some bilingual dictionary on Apertium. By exploring this graph, translations can be inferred that do not already exist on the bilingual dictionaries in Apertium. In particular, bilingual dictionaries can be generated for any two languages that do not have an existing language pair but exist on Apertium, albeit paired with other languages.

It is notable that polysemy diminishes the use of a simple transitive relation. This means that if vertex A-B and B-C are connected, it does not mean that A can be translated to C in all cases. Apertium’s existing “Crossdics” tool follows this model. For example, consider the English word ‘wrist’. It translates to the Spanish word ‘muñeca’, which in turn translates into French ‘poupée’. It turns out, muñeca is polysemous, or that it takes two semantic meanings, the English words ‘wrist’ and ‘doll’. Poupée does actually mean ‘doll’, but a transitive relation would map it with English ‘wrist’. Moreover, Crossdics is unable to infer translations from a chain of length greater than 3, i.e. if A-B, B-C, and C-D exist as pairs in Apertium, Crossdics will be unable to make predictions about A-D directly. It is possible to first generate the dictionary of pair A-C using B as an intermediate, and then proceeding to use C as an intermediate to A-D. However, the loss is naturally compounded in this case. Crossdics does allow users to write more complex rules on how to cross words from one bilingual dictionary to another, but this exhaustive manual listing can be avoided. In fact, language pair developers tend to prefer using other heuristics over crossdics to form a baseline dictionary.

My proposal will build upon the cycle density algorithm (M. Villegas et al., 2016) that defines confidence metrics in inferred edges based on cycle density. The underlying concept is that while linear chains might not be transitive, a cycle in the graph increases the confidence that vertices on the cycle are valid translation pairs. Currently, experimental results are available at ApertiumRDF-Github for the En-Es pair and an experiment was also conducted with French.

I will be implementing ideas discussed in this paper for the Apertium platform. My goal is to build a general algorithm capable of generating bilingual dictionaries for any user-specified pair. This would significantly scale the current existing work on this problem. I also plan to build upon the ideas proposed in the paper, by performing experiments of my own. A practical problem is that there can be wrong entries in the Apertium data, and the algorithm must be robust to handle these. In particular, as cycle density, simply the ratio between the number of edges and vertices on the cycle goes up, the confidence score also increases. By keeping a certain threshold confidence score, translations with a higher score than the threshold have to be picked. Computing cycle density (confidence scores) can be time-heavy and optimized methods have to be designed and implemented. The values of these metrics can only be evaluated after in-depth experimentation with different language pairs on Apertium.

Further layers need to be added to the procedure proposed in the paper.

  • The input layer has to be finalized. RDF graphs have not been updated regularly in the past and it will be useful to directly use bilingual dictionaries from the Trunk. RDF graphs also do not have the lemma analysis that bilingual dictionaries have.
  • For a given language pair, my code must decide which subgraph of the Apertium data is relevant. The paper uses the “context” as translations at a distance of up to 3. However, this is an experimental bound and will probably be different for Apertium data.
  • It then needs to compute cycles in this chosen graph, pruning cycles with specific properties. The pruning properties will include ones mentioned in the paper, such as “Cycles starting and ending in word W of language L may not contain other words from L.” as well as ones that improve performance during my experiments.
  • Since the chosen graphs will often be large, significant optimization will be necessary to ensure computation doesn’t take too long. There would also have to be a limitation on the length of the cycles (number of vertices). There is a scope for optimizing the cycle selection, based on precomputation techniques such as biconnected components. Biconnected components are maximal subgraphs that have no cut vertices. Cut vertices are those which when removed, disconnect the rest of the graph. It is notable that all vertices in a cycle belong to the same biconnected component, and thus each such subgraph can be used independently. It is also notable that the algorithm chooses the cycle with the highest density. It is possible that this property in itself creates scope for optimization by rejecting certain cycles without exploring them. I will analyze this theory problem and design an algorithm that gives the best practical results for the Apertium data.
  • Experiments will have to be done for generalizing the metrics suggested in the paper. Different amounts of data are available for different languages. The threshold confidence score used for the En-Es pair in the paper will not necessarily hold for other pairs. Based on the graph structure, a general formula may need to be arrived upon that gives good results.
  • The final output has to be generated for the different use cases in their respective required formats.

This will complete a new tool fully integrated with Apertium. Currently, no one has undertaken this as a GSoC project in the past. I will take special care to document every step of my work, including the required research, experiments, and code, so that future contributors find it easier to improve upon it.

Work Plan

Due to the delay in the GSoC timeline, the new semester in college will begin during the last few weeks. Hence I plan to invest 40+ hours during my summer break (till August 1) and complete the project requirements by then. I will also start work in the Community Bonding period itself to cover up. In August, I will work 20+ hours a week to explore extensions and features, apart from working on bug fixes and feedback.

I will be working using an inverted methodology, writing the harder parts of the project in the first month itself. This will ensure that any carry-overs can comfortably be finished and I have time to work out any significant problems. The first month will, therefore, focus on the tool itself. The second month will then be focused on making it specific to Apertium, and smoothening the process for language pair developers. At the end of 8 weeks, I plan to have my project ready for use. In the last month, I will look at deeper details and explore other opportunities that the Bilingual Dictionary graph provides.

Every 4th week also has some extra time allotted for unforeseen complications, to ensure the timeline can be followed without fail.

Phase 1

Community Bonding Period

(From May 4 - To May 31)

  • Understand better how bilingual dictionaries are written
  • Understand how bilingual dictionaries can be parsed by the program
  • Read more papers on optimized cycle finding algorithms
  • Design tests, experiments and evaluation procedures
  • Work with the community to understand practical nuances of compiling bilingual dictionaries and their properties
  • Write pseudocode/blueprint of my algorithm
Week 1

(From June 1 - To June 7)

  • Implement an unoptimized version of the algorithm
  • Perform the same tests as the reference research paper for a sanity check
  • Analyze data and classify problems
  • List ways to improve accuracy
  • Finalize list of performance (speed) optimizations
Week 2

(From June 8 - To June 14)

  • Start implementing accuracy and performance optimizations
  • Experiment with the same data to compare results
  • Analyze and classify points of failure, Experiment, Change, Repeat
Week 3

(From June 15 - To June 21)

  • Complete all optimizations to the code
  • Experiment on metrics such as confidence score threshold and decide accordingly
  • Compare performance across different datasets
  • Fix all points of failure
Week 4

(From June 22 - To June 28)

  • Tests for robustness:
    • Use incomplete/unchecked dictionaries in the Apertium Nursery/Staging
    • Feeding data with self-planted errors
    • Analyze these tests and perform required fixes
  • Compile Results for Deliverable 1
    • Read background information of Phase 2 of the Project
    • Extra time for any unexpected complications in the process

Deliverable #1

A practical bilingual dictionary prediction system that uses RDF Graph Input and gives word mapping as output

Phase 2

Week 5

(From June 29 - To July 4)

  • Shift from RDF Graphs to Apertium Bilingual Dictionaries
  • Understand how to preserve morphological information
  • Make the requisite changes in the code
  • Test by generating a new pair
  • Analyze points of failures
Week 6

(From July 5 - To July 11)

  • Perform tests for robustness as mentioned in Week 4
  • Fix all the points of failures
  • Incorporate different scripts and generalize to any language pair that could end up on the Apertium dataset
Week 7

(From July 12 - To July 18)

  • Create a smooth input/output experience for users (language pair developers)
  • Get it tested from different language pair developers
  • Incorporate feedback, fix bugs, improve on any practical problems that arise
Week 8

(From July 19 - To July 25)

  • Integrate with Apertium
  • Go through documentation and ensure it’s readable, write additional resources for future maintainers
  • Extra time for handling any unexpected complications in the process

Deliverable #2

A fully functioning bilingual dictionary generation via graph exploration tool is available for use for language pair developers

Phase 3

Week 9

(From July 26 - To August 1)

[Extension 1]:

  • If past robustness tests show sufficient success, explore if staging/nursery bidix can be used as intermediates.
  • If so, classify which.
  • Explore allowing the tool to automatically select intermediates given only the Source Language and Target Language.
Week 10

(From August 2 - To August 8)

[Extension 2]:

  • A major benefit of Bilingual Dictionaries over tools like Crossdics is the possibility that changes to a few entries do not necessitate re-generation of the whole dictionary. Utilize this feature by providing a different mode of input, where only a few additions have to be made to an existing bilingual dictionary, and predictions are made for other pairs accordingly.
Week 11

(From August 9 - To August 15)

  • Work on the bug fixes, feedback and further performance improvements of the above 2 features.
  • More interesting feature ideas will come up during the summer, implement them.
Week 12

(From August 16 - To August 22)

  • Extra time for bug fixes and feedback for extensions.
  • Complete documentation of the extensions.
  • Document future opportunities that can be explored.
  • Extra time for unforeseen complications.
Week 13

(From August 23 - To August 29)

  • Write the Evaluation Report
  • Extra time for unforeseen complications.

Deliverable #3

An advanced Bilingual Dictionary Generation tool extended for different use cases. Extensive Documentation and Evaluation Report. Project Completed

How and who will benefit from it in society?

Use Cases:

  • Often, contributors to Apertium set out to create a language pair X-Y such that both X and Y already exist on Apertium, albeit paired with different languages. In such cases, monolingual data exists and bilingual dictionaries and transfer rules are what need to be added. Using the recommended translations suggested by my system as a baseline, the contributor can then validate the relations to check for precision. Relations that were missed will have to be added by the contributor to improve recall. This is of course much easier than generating the bilingual dictionary from scratch.
  • The Apertium bilingual dictionaries are constantly changing, and as vocabulary gets added to one language pair, more information can be inferred about other language pairs. New words are coined in the modern age, and once they occur in one language pair. Interpreting Apertium Bilingual Dictionaries as a graph structure can make it easier to automatically propagate these changes to other language pairs. This will require the algorithm to be run periodically so that new suggestions across the different language pairs can be made and validated.
  • Dictionaries are useful for not just Apertium, but for academic linguists and users in general. The ability to use linked data to make such predictions is an important research problem that has a tangible benefit.
  • By making language pair development easier, it directly helps in Apertium’s mission to enable translation between more and more low-resource languages. This is of special use in countries like India where hundreds of low-resource languages are used actively. Many organizations are building mobile/web apps targeted at this audience, and often the interfaces of such apps require one-word buttons. Just making dictionaries is enough to assist initiatives like these.

Why should Apertium and Google sponsor it?

I feel the project has a wide scope, helping across languages and making work easier for all Apertium developers. While apertium currently relies on bilingual dictionaries, the multilingual graph incorporated structures the data to open a lot of future possibilities.

The sponsorship will enable me to work full-time on this problem and put my best effort.


A general overview of my skills can be found in my CV.

I have been engaged in linguistics and programming since High School. In 2016, I became one of the youngest students to represent India at the International Olympiad of Linguistics (IOL). I subsequently won a silver medal at the national level and an Honorable Mention at IOL 2019 in Yongin, South Korea. During this journey, I have self-learned a lot about linguistics. My algorithm/data structures journey too revolved around an olympiad, i.e. the International Olympiad of Informatics. In 2019, I ranked 6th in India at the national team selections.

I have performed well at the ACM ICPC Regionals this year. I also enjoyed Google HashCode, an optimization challenge, much like the one I plan to take up in the summer. At some point, I decided to combine my background and explore NLP. I have done some projects, including a research project for which I was selected to represent India at ISEF 2019.

The problem I have chosen requires a strong background in algorithmic thinking, problem solving and linguistics. I believe my experience equips me to achieve my goals for the summer.

Work done on this problem

I have gone through the research papers provided to me by my mentors and compiled notes. I have discussed my observations with them. Some of these are already mentioned in the Proposal section. I have thought about the scope for optimization and read some papers for the same. Most of this has been included in the above proposal.

I have completed the coding challenge prescribed to me by my mentor Mikel Forcada (mlforcada). It is available at

In the coding challenge, I have used formatted RDF Graph inputs from ApertiumRDF-Github. The input consists of “<word><POS><language>” representing each lemma and each line representing a translation between two lemmas, which is an edge. The current algorithm is quite rudimentary, and just a proof of concept. However, it can be seen that it performs well on the example discussed above, wherein it does not translate doll-en to poupée-fr, unlike Apertium Crossdics. Moreover, the ability to infer using data from multiple languages is shown. The data contains files with up to 13 languages at once.

Currently, the finer details such as node-degree based correction of polysemy from the Marta Villegas paper mentioned above have not been implemented. However, some of the code optimizations that I have already done are:

  • Using a stack and backtracking to optimize finding cycles in the Depth First Search.
  • Including the accuracy optimization that no 2 nodes should be from the same language.
  • Using C++ Sets and Maps instead of adding another loop layer like the Python code published by the authors.

The output file displays results for each word. Newly inferred translations and the ‘context’ of the word are presented in the output, along with corresponding confidence scores. It can be seen in files that new translations to multiple languages for each word have been found, for which the language pair did not exist.

The coding challenge prototype performs poorly at files with high edge-node ratio (boy) or low amounts of data (veterinarian/veterinario). For boy.txt, the precision is too low. Due to the high density of the graph, the density threshold is too low. A generalized density threshold will have to be experimented upon during GSoC. Similarly, the threshold is too high for veterinarian.txt. Note that Crossdics would not have performed any better at such translations, even if it could be somehow extended to handle multiple language data at once. Such an extension would come at a natural loss of accuracy too.

Completely out of interest, I have also started reading about different techniques for the same problem like ‘CQC - Cycles and Quasi Cycles’ and some based on Markov Models. This might lead to some inspiration that can be incorporated in the project. In a much broader attempt, I am also reading about linear algebraic views of graphs such as cycle-space theory. I plan to cover prerequisites on these before the GSoC period so that if there is scope to utilize my learnings, I can do it from Day 1 itself!


If I am selected, it will be my only commitment for the summer. As mentioned in the work plan, my college has a summer break from May 1 - July 31, so I will complete most of my work in this period. I will devote 40+ hours a week during this time. In August, I will be devoting 20+ hours a week as my college classes would have begun. There might be slight changes to this due to COVID-19, but I will cover up any time lost at the earliest.


Jorge Gracia, Marta Villegas, Asunción Gómez-Pérez and Núria Bel. The Apertium Bilingual Dictionaries on the Web of Data. Semantic Web-Journal, 2016.

Marta Villegas, Maite Melero, Nuria Bel, and Jorge Gracia. Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 2016. European Language Resources Association (ELRA).

T. Flati and R. Navigli. "The CQC algorithm: Cycling in graphs to semantically enrich and enhance a bilingual dictionary." Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013.