Ideas for Google Summer of Code/Advanced Wikipedia translation

From Apertium
Jump to navigation Jump to search

Translating Wikipedia, and Wikipedia's wiki syntax, presents a different sort of challenge to the usual formats in Apertium, because much of the formatting exists to convey meaning, and this meaning must be considered by the translator of a wikipedia article.

At the basic level, links, categories, and templates cannot simply be transferred from one wikipedia to another, so it's not appropriate to represent them exactly in the output. There are, however, typically equivalents in the target wikipedia that should be used instead - they should be 'translated'.

For example, the escaped version of an Apertium translation of 'árbol' could look like this:

[\[\[árbol (teoría de grafos)|]tree[\]\]]

...which isn't as useful as it could be. Much more useful would be:

[\[\[tree (graph theory)|]tree[\]\]]

Link mappings could be substituted using a database. Fortunately, that database already exists:

DBPedia

DBPedia is a database containing data extracted from Wikipedia. Initially targeting the English Wikipedia, it is currently being extended to other languages. At present, for several Wikipedias, it's possible to craft a query for DBPedia to get the equivalent page for a number of items - links and categories, for example.

In addition, the extraction templates used to extract information from Wikipedia's infoboxes could be used to construct infoboxes for the target.

Take, for example, the templates for the philosopher infobox for en.wikipedia and ca.wikipedia:

{{TemplateMapping
| mapToClass = Philosopher
| mappings =
	{{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
	{{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthDate }}
	{{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthYear }}
	{{PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }}
	{{PropertyMapping | templateProperty = death_date | ontologyProperty = deathDate }}
	{{PropertyMapping | templateProperty = death_date | ontologyProperty = deathYear }}
	{{PropertyMapping | templateProperty = death_place | ontologyProperty = deathPlace }}
	{{PropertyMapping | templateProperty = region | ontologyProperty = region }}
	{{PropertyMapping | templateProperty = era | ontologyProperty = era }}
	{{PropertyMapping | templateProperty = school_tradition | ontologyProperty = philosophicalSchool }}
	{{PropertyMapping | templateProperty = main_interests | ontologyProperty = mainInterest }}
	{{PropertyMapping | templateProperty = notable_ideas  | ontologyProperty = notableIdea }}
	{{PropertyMapping | templateProperty = influences | ontologyProperty = influencedBy }}
	{{PropertyMapping | templateProperty = influenced | ontologyProperty = influenced }}
}}
{{TemplateMapping
| mapToClass = Philosopher
| mappings =
	{{PropertyMapping | templateProperty = nom | ontologyProperty = foaf:name }}
	{{PropertyMapping | templateProperty = naixement | ontologyProperty = birthDate }}
	{{PropertyMapping | templateProperty = mort | ontologyProperty = deathDate }}
	{{PropertyMapping | templateProperty = regio | ontologyProperty = region }}
	{{PropertyMapping | templateProperty = era | ontologyProperty = era }}
	{{PropertyMapping | templateProperty = escola_tradicio | ontologyProperty = philosophicalSchool }}
	{{PropertyMapping | templateProperty = interessos | ontologyProperty = mainInterest }}
	{{PropertyMapping | templateProperty = idees  | ontologyProperty = notableIdea }}
	{{PropertyMapping | templateProperty = influencies | ontologyProperty = influencedBy }}
	{{PropertyMapping | templateProperty = influencia | ontologyProperty = influenced }}
}}

Given a list of mappings between equivalent infoboxes, it would be possible to generate a target infobox based on the ontologyProperty mappings. This would be a generally useful feature to have in the DBPedia framework (to allow bots to automatically update information in one Wikipedia's infobox from another). It might be possible to use the mapToClass property to infer alignments between templates, but as this may be a many to many mapping, it would be better to not try to infer this.