User:Shash42/GSoC 2020 Proposal: Bilingual Dictionary Discovery

From Apertium
Jump to navigation Jump to search

Google Summer of Code 2020: Proposal

Bilingual Dictionary Generation via Graph Exploration

Contact Details

About Me

I am an undergraduate research student pursuing Computer Science at the International Institute of Information Technology, Hyderabad. I have been involved in CS and Linguistics through olympiads for the past 4 years.


I am interested in Natural Language Processing, Deep Learning, and Applied Math. I am fascinated by computational solutions where they're least expected. My hobbies are competitive programming, math (problem-solving of any sort really), writing, playing chess and keeping in touch with developments in space exploration.

Why am I interested in Apertium?

Growing up in India, I realized that the very linguistic and cultural diversity that leads to unique perspectives acts as a barrier to inclusion. Technological solutions do not attract large sections of society as they just can’t be communicated to them. Seamless translation systems and the inclusion of regional languages is, therefore, the need of the hour. However, popular deep learning based solutions for machine translation cannot fulfill the task of preserving linguistic diversity, simply because endangered languages don’t offer sufficient data. Apertium is much more suited for this goal. It is inherently targeted at low-resource, similar yet mutually unintelligible language-pairs. It is a dream organization for me because I get to contribute to a cause that is close to my heart.

Besides this, the project that I wish to work on is ideally suited for my interests. It allows me to work on an open research problem, having immense scope for creativity, optimization, and experimentation. It suitably lies at the intersection of two of my favorite topics, graph theory, and semantics. It goes without saying that I find it a perfect fit!


Which task am I interested in? What do I plan to do?

I am interested in implementing Bilingual Dictionary Discovery via Graph Completion for Apertium. At the time of writing this proposal, Apertium has 54 language pairs in the Trunk. These spread across 43 languages. This clearly shows that among these 43 languages, Apertium cannot translate between most randomly chosen language-pairs.

A crucial step in developing a language pair is writing its bilingual dictionary, which in essence maps a lemma X in language A to a lemma Y in language B if X and Y have the same meaning and accompanying lexical analysis. It preserves morphological transfer rules between surface forms of the lemma.

Apertium’s existing dataset has appreciable amounts of data, which can be visualized in a graph-like structure, previously done in the Apertium RDF project (Jorge Gracia et al., 2016). In essence, each <lexical entry, language> pair is visualized as a ‘vertex’ on a graph, with an ‘edge’ between vertices that exist as a relation in some bilingual dictionary on Apertium. By exploring this graph, translations can be inferred that do not already exist on the bilingual dictionaries in Apertium. In particular, bilingual dictionaries can be generated for any two languages that do not have an existing language pair but exist on Apertium, albeit paired with other languages.

It is notable that polysemy diminishes the use of a simple transitive relation. This means that if vertex A-B and B-C are connected, it does not mean that A can be translated to C in all cases. Apertium’s existing “Crossdics” tool follows this model. For example, consider the English word ‘wrist’. It translates to the Spanish word ‘muñeca’, which in turn translates into French ‘poupée’. It turns out, muñeca is polysemous, or that it takes two semantic meanings, the English words ‘wrist’ and ‘doll’. Poupée does actually mean ‘doll’, but a transitive relation would map it with English ‘wrist’. Moreover, Crossdics is unable to infer translations from a chain of length greater than 3, i.e. if A-B, B-C, and C-D exist as pairs in Apertium, Crossdics will be unable to make predictions about A-D directly. It is possible to first generate the dictionary of pair A-C using B as an intermediate, and then proceeding to use C as an intermediate to A-D. However, the loss is naturally compounded in this case. Crossdics does allow users to write more complex rules on how to cross words from one bilingual dictionary to another, but this exhaustive manual listing can be avoided. In fact, language pair developers tend to prefer using other heuristics over crossdics to form a baseline dictionary.

My proposal will build upon the cycle density algorithm (M. Villegas et al., 2016) that defines confidence metrics in inferred edges based on cycle density. The underlying concept is that while linear chains might not be transitive, a cycle in the graph increases the confidence that vertices on the cycle are valid translation pairs. Currently, experimental results are available at ApertiumRDF-Github for the En-Es pair and an experiment was also conducted with French.

I will be implementing ideas discussed in this paper for the Apertium platform. My goal is to build a general algorithm capable of generating bilingual dictionaries for any user-specified pair. This would significantly scale the current existing work on this problem. I also plan to build upon the ideas proposed in the paper, by performing experiments of my own. A practical problem is that there can be wrong entries in the Apertium data, and the algorithm must be robust to handle these. In particular, as cycle density, simply the ratio between the number of edges and vertices on the cycle goes up, the confidence score also increases. By keeping a certain threshold confidence score, translations with a higher score than the threshold have to be picked. Computing cycle density (confidence scores) can be time-heavy and optimized methods have to be designed and implemented. The values of these metrics can only be evaluated after in-depth experimentation with different language pairs on Apertium.

Further layers need to be added to the procedure proposed in the paper.

  • The input layer has to be finalized. RDF graphs have not been updated regularly in the past and it will be useful to directly use bilingual dictionaries from the Trunk. RDF graphs also do not have the lemma analysis that bilingual dictionaries have.
  • For a given language pair, my code must decide which subgraph of the Apertium data is relevant. The paper uses the “context” as translations at a distance of up to 3. However, this is an experimental bound and will probably be different for Apertium data.
  • It then needs to compute cycles in this chosen graph, pruning cycles with specific properties. The pruning properties will include ones mentioned in the paper, such as “Cycles starting and ending in word W of language L may not contain other words from L.” as well as ones that improve performance during my experiments.
  • Since the chosen graphs will often be large, significant optimization will be necessary to ensure computation doesn’t take too long. There would also have to be a limitation on the length of the cycles (number of vertices). There is a scope for optimizing the cycle selection, based on precomputation techniques such as biconnected components. Biconnected components are maximal subgraphs that have no cut vertices. Cut vertices are those which when removed, disconnect the rest of the graph. It is notable that all vertices in a cycle belong to the same biconnected component, and thus each such subgraph can be used independently. It is also notable that the algorithm chooses the cycle with the highest density. It is possible that this property in itself creates scope for optimization by rejecting certain cycles without exploring them. I will analyze this theory problem and design an algorithm that gives the best practical results for the Apertium data.
  • Experiments will have to be done for generalizing the metrics suggested in the paper. Different amounts of data are available for different languages. The threshold confidence score used for the En-Es pair in the paper will not necessarily hold for other pairs. Based on the graph structure, a general formula may need to be arrived upon that gives good results.
  • The final output has to be generated for the different use cases in their respective required formats.