Ideas for Google Summer of Code/Bilingual dictionary enrichment via graph completion

From Apertium
Jump to navigation Jump to search

Apertium bilingual dictionaries establish correspondences between lexical forms in a number of language pairs that may be used to infer new entries for existing or new language pairs using graphs.

Some of this work may be done by exploiting the RDF version of the Apertium bilingual dictionaries [1,2].

In a nutshell, RDF (based on the subject-predicate-object pattern) is the core representation mechanism of the so-called linked data cloud. This can be seen as the counterpart of the traditional Web (in which documents, e.g. web pages, are connected by hyperlinks, typically to be navigated by humans). In linked data, entities are connected among them in a graph and relevant information is stated about them. The cloud of linked data is intended to be navigated by software agents primarily. In the case of Apertium RDF, the graph nodes are the set of lexical entries, lexical senses, and translations, among other entities, coming form all the bilingual dictionaries. Interestingly, a unique identifier (URI) has been assigned to any of such entities at a Web scale (e.g., [3]) and can be the entry point for a software agent to navigate the graph. Another way to exploit the graph is by querying it through a SPARQL endpoint [4].

The project will:

  • comparatively study dictionary enrichment directly from bilingual dictionaries in the .dix format or via the existing RDF graphs, using the ideas in a paper under public review and ideas that had already been proposed in Apertium (e.g. removing a dictionary and checking how much of it can be recovered).
  • if the solution via RDF graphs works, establish a .dix to RDF conversion using XSLT or other XML processors, and a backward conversion from RDF to .dix, studying how much is lost in the roundtrip, and defining a mitigating strategy in that last case.
  • if the direct route is used instead, start from dixtools code and implement the graph completion strategies
  • study possible use of data linked to the RDF to enrich the dictionaries, when the license of the linked data allows for republication under the GPL license of the dictionaries.

Coding challenge: a toy task related to the main task in this idea: use XSLT stylesheets or light (e.g. shell-scripted) XML processors to extract a number of "easy" dictionary entries from a set of dictionaries, convert them to some suitable format for a graph, and obtaining a number of "easy" new bilingual correspondences from the graph.