Difference between revisions of "Reordering superblanks"

From Apertium
Jump to navigation Jump to search
Line 15: Line 15:
 
<pre><a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a></pre>
 
<pre><a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a></pre>
   
  +
==Problems==
 
All language pairs either do this, or have a possibility of messing up the format:
+
All language pairs do this kind of thing:
 
<pre>
 
<pre>
 
$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
 
$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
 
<i>White</i> <b>dog</b>
 
<i>White</i> <b>dog</b>
 
</pre>
 
</pre>
  +
And those that don't, will at some point mess up whatever formatting they're given.
  +
  +
  +
The problem is not only that we bold or italicise the wrong word, but it also limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information is useful for systems like [https://www.mediawiki.org/wiki/Content_translation Mediawiki's Content Translation].
  +
  +
==Possible solution==

Revision as of 18:36, 25 May 2014

Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium.

If the input is

<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>

and we want to reorder the words, we currently only reorder the words, and don't touch (or even look at) the blanks, since we don't want to mess up the html, so the output becomes

<a id="foobar" href="http://example.com">Бар <b>фоо</b>.</a>

but now the bold has shifted from source word "bar" to the target word that was "foo" in the input.

Ideally, the output should be

<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>

Problems

All language pairs do this kind of thing:

$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
<i>White</i> <b>dog</b>

And those that don't, will at some point mess up whatever formatting they're given.


The problem is not only that we bold or italicise the wrong word, but it also limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information is useful for systems like Mediawiki's Content Translation.

Possible solution