User:Shash42/GSoC 2020 Proposal: Bilingual Dictionary Discovery

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2020: Proposal

Bilingual Dictionary Generation via Graph Exploration

Contact Details

About Me

I am an undergraduate research student pursuing Computer Science at the International Institute of Information Technology, Hyderabad. I have been involved in CS and Linguistics through olympiads for the past 4 years.


I am interested in Natural Language Processing, Deep Learning, and Applied Math. I am fascinated by computational solutions where they're least expected. My hobbies are competitive programming, math (problem-solving of any sort really), writing, playing chess and keeping in touch with developments in space exploration.

Why am I interested in Apertium?

Growing up in India, I realized that the very linguistic and cultural diversity that leads to unique perspectives acts as a barrier to inclusion. Technological solutions do not attract large sections of society as they just can’t be communicated to them. Seamless translation systems and the inclusion of regional languages is, therefore, the need of the hour. However, popular deep learning based solutions for machine translation cannot fulfill the task of preserving linguistic diversity, simply because endangered languages don’t offer sufficient data. Apertium is much more suited for this goal. It is inherently targeted at low-resource, similar yet mutually unintelligible language-pairs. It is a dream organization for me because I get to contribute to a cause that is close to my heart.

Besides this, the project that I wish to work on is ideally suited for my interests. It allows me to work on an open research problem, having immense scope for creativity, optimization, and experimentation. It suitably lies at the intersection of two of my favorite topics, graph theory, and semantics. It goes without saying that I find it a perfect fit!


Which task am I interested in? What do I plan to do?

I am interested in implementing Bilingual Dictionary Discovery via Graph Completion for Apertium. At the time of writing this proposal, Apertium has 54 language pairs in the Trunk. These spread across 43 languages. This clearly shows that among these 43 languages, Apertium cannot translate between most randomly chosen language-pairs.

A crucial step in developing a language pair is writing its bilingual dictionary, which in essence maps a lemma X in language A to a lemma Y in language B if X and Y have the same meaning and accompanying lexical analysis. It preserves morphological transfer rules between surface forms of the lemma.

Apertium’s existing dataset has appreciable amounts of data, which can be visualized in a graph-like structure, previously done in the Apertium RDF project (Jorge Gracia et al., 2016). In essence, each <lexical entry, language> pair is visualized as a ‘vertex’ on a graph, with an ‘edge’ between vertices that exist as a relation in some bilingual dictionary on Apertium. By exploring this graph, translations can be inferred that do not already exist on the bilingual dictionaries in Apertium. In particular, bilingual dictionaries can be generated for any two languages that do not have an existing language pair but exist on Apertium, albeit paired with other languages.

It is notable that polysemy diminishes the use of a simple transitive relation. This means that if vertex A-B and B-C are connected, it does not mean that A can be translated to C in all cases. Apertium’s existing “Crossdics” tool follows this model. For example, consider the English word ‘wrist’. It translates to the Spanish word ‘muñeca’, which in turn translates into French ‘poupée’. It turns out, muñeca is polysemous, or that it takes two semantic meanings, the English words ‘wrist’ and ‘doll’. Poupée does actually mean ‘doll’, but a transitive relation would map it with English ‘wrist’. Moreover, Crossdics is unable to infer translations from a chain of length greater than 3, i.e. if A-B, B-C, and C-D exist as pairs in Apertium, Crossdics will be unable to make predictions about A-D directly. It is possible to first generate the dictionary of pair A-C using B as an intermediate, and then proceeding to use C as an intermediate to A-D. However, the loss is naturally compounded in this case. Crossdics does allow users to write more complex rules on how to cross words from one bilingual dictionary to another, but this exhaustive manual listing can be avoided. In fact, language pair developers tend to prefer using other heuristics over crossdics to form a baseline dictionary.

My proposal will build upon the cycle density algorithm (M. Villegas et al., 2016) that defines confidence metrics in inferred edges based on cycle density. The underlying concept is that while linear chains might not be transitive, a cycle in the graph increases the confidence that vertices on the cycle are valid translation pairs. Currently, experimental results are available at ApertiumRDF-Github for the En-Es pair and an experiment was also conducted with French.

I will be implementing ideas discussed in this paper for the Apertium platform. My goal is to build a general algorithm capable of generating bilingual dictionaries for any user-specified pair. This would significantly scale the current existing work on this problem. I also plan to build upon the ideas proposed in the paper, by performing experiments of my own. A practical problem is that there can be wrong entries in the Apertium data, and the algorithm must be robust to handle these. In particular, as cycle density, simply the ratio between the number of edges and vertices on the cycle goes up, the confidence score also increases. By keeping a certain threshold confidence score, translations with a higher score than the threshold have to be picked. Computing cycle density (confidence scores) can be time-heavy and optimized methods have to be designed and implemented. The values of these metrics can only be evaluated after in-depth experimentation with different language pairs on Apertium.

Further layers need to be added to the procedure proposed in the paper.

  • The input layer has to be finalized. RDF graphs have not been updated regularly in the past and it will be useful to directly use bilingual dictionaries from the Trunk. RDF graphs also do not have the lemma analysis that bilingual dictionaries have.
  • For a given language pair, my code must decide which subgraph of the Apertium data is relevant. The paper uses the “context” as translations at a distance of up to 3. However, this is an experimental bound and will probably be different for Apertium data.
  • It then needs to compute cycles in this chosen graph, pruning cycles with specific properties. The pruning properties will include ones mentioned in the paper, such as “Cycles starting and ending in word W of language L may not contain other words from L.” as well as ones that improve performance during my experiments.
  • Since the chosen graphs will often be large, significant optimization will be necessary to ensure computation doesn’t take too long. There would also have to be a limitation on the length of the cycles (number of vertices). There is a scope for optimizing the cycle selection, based on precomputation techniques such as biconnected components. Biconnected components are maximal subgraphs that have no cut vertices. Cut vertices are those which when removed, disconnect the rest of the graph. It is notable that all vertices in a cycle belong to the same biconnected component, and thus each such subgraph can be used independently. It is also notable that the algorithm chooses the cycle with the highest density. It is possible that this property in itself creates scope for optimization by rejecting certain cycles without exploring them. I will analyze this theory problem and design an algorithm that gives the best practical results for the Apertium data.
  • Experiments will have to be done for generalizing the metrics suggested in the paper. Different amounts of data are available for different languages. The threshold confidence score used for the En-Es pair in the paper will not necessarily hold for other pairs. Based on the graph structure, a general formula may need to be arrived upon that gives good results.
  • The final output has to be generated for the different use cases in their respective required formats.

This will complete a new tool fully integrated with Apertium. Currently, no one has undertaken this as a GSoC project in the past. I will take special care to document every step of my work, including the required research, experiments, and code, so that future contributors find it easier to improve upon it.

Work Plan

Due to the delay in the GSoC timeline, the new semester in college will begin during the last few weeks. Hence I plan to invest 40+ hours during my summer break (till August 1) and complete the project requirements by then. I will also start work in the Community Bonding period itself to cover up. In August, I will work 20+ hours a week to explore extensions and features, apart from working on bug fixes and feedback.

I will be working using an inverted methodology, writing the harder parts of the project in the first month itself. This will ensure that any carry-overs can comfortably be finished and I have time to work out any significant problems. The first month will, therefore, focus on the tool itself. The second month will then be focused on making it specific to Apertium, and smoothening the process for language pair developers. At the end of 8 weeks, I plan to have my project ready for use. In the last month, I will look at deeper details and explore other opportunities that the Bilingual Dictionary graph provides.

Every 4th week also has some extra time allotted for unforeseen complications, to ensure the timeline can be followed without fail.

Phase 1

Community Bonding Period

(From May 4 - To May 31)

  • Understand better how bilingual dictionaries are written
  • Understand how bilingual dictionaries can be parsed by the program
  • Read more papers on optimized cycle finding algorithms
  • Design tests, experiments and evaluation procedures
  • Work with the community to understand practical nuances of compiling bilingual dictionaries and their properties
  • Write pseudocode/blueprint of my algorithm
Week 1

(From July 1 - To July 7)

  • Implement an unoptimized version of the algorithm
  • Perform the same tests as the reference research paper for a sanity check
  • Analyze data and classify problems
  • List ways to improve accuracy
  • Finalize list of performance (speed) optimizations
Week 2

(From July 8 - To July 14)

  • Start implementing accuracy and performance optimizations
  • Experiment with the same data to compare results
  • Analyze and classify points of failure, Experiment, Change, Repeat
Week 3

(From July 15 - To July 21)

  • Complete all optimizations to the code
  • Experiment on metrics such as confidence score threshold and decide accordingly
  • Compare performance across different datasets
  • Fix all points of failure
Week 4

(From July 22 - To July 28)

  • Tests for robustness:
    • Use incomplete/unchecked dictionaries in the Apertium Nursery/Staging
    • Feeding data with self-planted errors
    • Analyze these tests and perform required fixes
  • Compile Results for Deliverable 1
    • Read background information of Phase 2 of the Project
    • Extra time for any unexpected complications in the process

Deliverable #1Italic text

A practical bilingual dictionary prediction system that uses RDF Graph Input and gives word mapping as output