Ideas for Google Summer of Code/Bilingual dictionary enrichment via graph completion

From Apertium
Jump to navigation Jump to search

Apertium bilingual dictionaries establish correspondences between lexical forms in a number of language pairs may be used to infer new entries for existing or new language pairs using graphs.

Some of this work may be done via the RDF graphs derived from them [1,2].

The project will:

  • comparatively study dictionary enrichment directly from bilingual dictionaries in the .dix format or via the existing RDF graphs, using the ideas in a paper under public review and ideas that had already been proposed in Apertium (e.g. removing a dictionary and checking how much of it can be recovered).
  • if the solution via RDF graphs works, establish a .dix to RDF conversion using XSLT or other XML processors, and a backward conversion from RDF to .dix, studying how much is lost in the roundtrip, and defining a mitigating strategy in that last case.
  • if the direct route is used instead, start from dixtools code and implement the graph completion strategies
  • study possible use of data linked to the RDF to enrich the dictionaries, when the license of the linked data allows for republication under the GPL license of the dictionaries.

Coding challenge: a toy task related to the main task in this idea: use XSLT stylesheets or light (e.g. shell-scripted) XML processors to extract a number of "easy" dictionary entries from a set of dictionaries, convert them to some suitable format for a graph, and obtaining a number of "easy" new bilingual correspondences from the graph.

[1] https://jogracia.wordpress.com/2015/06/24/the-apertium-dictionaries-on-the-web-of-data/

[2] http://www.semantic-web-journal.net/content/apertium-bilingual-dictionaries-web-data