Difference between revisions of "Ideas for Google Summer of Code/Advanced Wikipedia translation"

From Apertium
Jump to navigation Jump to search
m (blah)
(withdraw the idea -- WikiBhasha implements these ideas, as well as a translation interface.)
 
Line 1: Line 1:
  +
<div style="align: center; border-collapse: collapse; background: #fbfbfb; border: 1px solid #aaa; border-left: 10px solid #1e90ff;">
  +
&nbsp;&nbsp;This idea has been withdrawn. Microsoft's [http://www.wikibhasha.org/index.htm WikiBhasha] implements most of the ideas presented here. (WikiBhasha is [http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikiBhasha/ open source]).
  +
</div>
  +
 
Translating Wikipedia, and Wikipedia's wiki syntax, presents a different sort of challenge to the usual formats in Apertium, because much of the ''formatting'' exists to convey meaning, and this meaning must be considered by the translator of a wikipedia article.
 
Translating Wikipedia, and Wikipedia's wiki syntax, presents a different sort of challenge to the usual formats in Apertium, because much of the ''formatting'' exists to convey meaning, and this meaning must be considered by the translator of a wikipedia article.
   

Latest revision as of 17:09, 21 March 2011

  This idea has been withdrawn. Microsoft's WikiBhasha implements most of the ideas presented here. (WikiBhasha is open source).

Translating Wikipedia, and Wikipedia's wiki syntax, presents a different sort of challenge to the usual formats in Apertium, because much of the formatting exists to convey meaning, and this meaning must be considered by the translator of a wikipedia article.

At the basic level, links, categories, and templates cannot simply be transferred from one wikipedia to another, so it's not appropriate to represent them exactly in the output. There are, however, typically equivalents in the target wikipedia that should be used instead - they should be 'translated'.

For example, the escaped version of an Apertium translation of 'árbol' could look like this:

[\[\[árbol (teoría de grafos)|]tree[\]\]]

...which isn't as useful as it could be. Much more useful would be:

[\[\[tree (graph theory)|]tree[\]\]]

Link mappings could be substituted using a database. Fortunately, that database already exists:

DBPedia[edit]

DBPedia is a database containing data extracted from Wikipedia. Initially targeting the English Wikipedia, it is currently being extended to other languages. At present, for several Wikipedias, it's possible to craft a query for DBPedia to get the equivalent page for a number of items - links and categories, for example.

In addition, the extraction templates used to extract information from Wikipedia's infoboxes could be used to construct infoboxes for the target.

Take, for example, the templates for the philosopher infobox for en.wikipedia and ca.wikipedia:

{{TemplateMapping
| mapToClass = Philosopher
| mappings =
	{{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
	{{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthDate }}
	{{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthYear }}
	{{PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }}
	{{PropertyMapping | templateProperty = death_date | ontologyProperty = deathDate }}
	{{PropertyMapping | templateProperty = death_date | ontologyProperty = deathYear }}
	{{PropertyMapping | templateProperty = death_place | ontologyProperty = deathPlace }}
	{{PropertyMapping | templateProperty = region | ontologyProperty = region }}
	{{PropertyMapping | templateProperty = era | ontologyProperty = era }}
	{{PropertyMapping | templateProperty = school_tradition | ontologyProperty = philosophicalSchool }}
	{{PropertyMapping | templateProperty = main_interests | ontologyProperty = mainInterest }}
	{{PropertyMapping | templateProperty = notable_ideas  | ontologyProperty = notableIdea }}
	{{PropertyMapping | templateProperty = influences | ontologyProperty = influencedBy }}
	{{PropertyMapping | templateProperty = influenced | ontologyProperty = influenced }}
}}
{{TemplateMapping
| mapToClass = Philosopher
| mappings =
	{{PropertyMapping | templateProperty = nom | ontologyProperty = foaf:name }}
	{{PropertyMapping | templateProperty = naixement | ontologyProperty = birthDate }}
	{{PropertyMapping | templateProperty = mort | ontologyProperty = deathDate }}
	{{PropertyMapping | templateProperty = regio | ontologyProperty = region }}
	{{PropertyMapping | templateProperty = era | ontologyProperty = era }}
	{{PropertyMapping | templateProperty = escola_tradicio | ontologyProperty = philosophicalSchool }}
	{{PropertyMapping | templateProperty = interessos | ontologyProperty = mainInterest }}
	{{PropertyMapping | templateProperty = idees  | ontologyProperty = notableIdea }}
	{{PropertyMapping | templateProperty = influencies | ontologyProperty = influencedBy }}
	{{PropertyMapping | templateProperty = influencia | ontologyProperty = influenced }}
}}

Given a list of mappings between equivalent infoboxes, it would be possible to generate a target infobox based on the ontologyProperty mappings. This would be a generally useful feature to have in the DBPedia framework (to allow bots to automatically update information in one Wikipedia's infobox from another). It might be possible to use the mapToClass property to infer alignments between templates, but as this may be a many to many mapping, it would be better to not try to infer this.