Difference between revisions of "Reordering superblanks"

Revision as of 18:54, 25 May 2014

Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium.

If the input is

<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>

and we want to reorder the words, we currently only reorder the words, and don't touch (or even look at) the blanks, since we don't want to mess up the html, so the output becomes

<a id="foobar" href="http://example.com">Бар <b>фоо</b>.</a>

but now the bold has shifted from source word "bar" to the target word that was "foo" in the input.

Ideally, the output should be

<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>

Problems

All language pairs do this kind of thing:

$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
<i>White</i> <b>dog</b>

And those that don't, will at some point mess up whatever formatting they're given.

The problem is not only that we bold or italicise the wrong word, but also that it limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information would be useful for systems like Mediawiki's Content Translation (see discussion).

Possible solution

User:TinoDidriksen's post at http://comments.gmane.org/gmane.comp.nlp.apertium/3921 outlines a solution:

For each format, we need a list of inline tags; for HTML this would include , and so on.

Other tags, like are treated as before, but inline tags stick with their words:

Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into

    My <b><i>
    sister <b><i>
    lives <b><i>
    in <b><u>
    Wales <b><u>

Now on outputting, we can just put the inline tags on each word – this might mean some tags are unnecessarily duplicated, but that should be fine.

To deal with this in Apertium, we need

Lists of inline tags for deformatters
Deformatters to turn foo bar into something like <code>[][{}]foo [{}]bar[]</code>
- Could it be as simple as [{}] or does it have to be more complicated?
And vice versa for reformatters
Transfer modules to

@@ Line 27: / Line 27: @@
 ==Possible solution==
+[[User:TinoDidriksen]]'s post at http://comments.gmane.org/gmane.comp.nlp.apertium/3921 outlines a solution:
+For each format, we need a list of '''inline tags'''; for HTML this would include &lt;b&gt;, &lt;i&gt; and so on.
+Other tags, like &lt;p&gt; are treated as before, but inline tags stick with their words:
+<pre>
+Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into
+    My <b><i>
+    sister <b><i>
+    lives <b><i>
+    in <b><u>
+    Wales <b><u>
+</pre>
+Now on outputting, we can just put the inline tags on each word – this might mean some tags are unnecessarily duplicated, but that should be fine.
+To deal with this in Apertium, we need
+# Lists of inline tags for deformatters
+# Deformatters to turn <code><p><b><i>foo</i> bar</b></p></code> into something like <nowiki><code>[<p>][{<b><i>}]foo [{<b>}]bar[</p>]</code></nowiki>
+#* Could it be as simple as <nowiki>[{}]</nowiki> or does it have to be more complicated?
+# And vice versa for reformatters
+# Transfer modules to

Difference between revisions of "Reordering superblanks"

Revision as of 18:54, 25 May 2014

Problems

Possible solution

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools