Reordering superblanks
Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium.
If the input is
<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>
and we want to reorder the words, we currently only reorder the words, and don't touch (or even look at) the blanks, since we don't want to mess up the html, so the output becomes
<a id="foobar" href="http://example.com">Бар <b>фоо</b>.</a>
but now the bold has shifted from source word "bar" to the target word that was "foo" in the input.
Ideally, the output should be
<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>
Problems
All language pairs do this kind of thing:
$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html <i>White</i> <b>dog</b>
And those that don't, will at some point mess up whatever formatting they're given.
The problem is not only that we bold or italicise the wrong word, but also that it limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information would be useful for systems like Mediawiki's Content Translation (see discussion).
A more serious problem, noted by User:Mlforcada in this discussion, is that tags that are in a valid order in t1x can still be moved around inside chunks in t2x, e.g.
input: <i>foo<i> <b>bar fum</b> fie after t1x: [<i>]^SN{^foo<adj>$[<i> <b>]^bar<n>$}$ ^SV{^fum<adv>$[</b> ]^fie<vblex>$}$ after t2x: [<i>]^SV{^fum<adv>$[</b> ]^fie<vblex>$}$ ^SN{^foo<adj>$[<i> <b>]^bar<n>$}$
Possible solution
User:Tino Didriksen's post at http://comments.gmane.org/gmane.comp.nlp.apertium/3921 outlines a solution:
For each format, we need a list of inline tags; for HTML this would include <b>, <i> and so on.
Other tags, like <p> are treated as before, but inline tags stick with their words:
Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into My <b><i> sister <b><i> lives <b><i> in <b><u> Wales <b><u>
Now on outputting, we can just put the inline tags on each word – this might mean some tags are unnecessarily duplicated, but that should be fine.
What we need to support something like this in Apertium:
- Each deformatter needs a list of which tags need the inline treatment
- Deformatters have to turn
<p><b><i>foo</i> bar</b></p>
into something like[<p>][{<b><i>}]foo [{<b>}]bar[</p>]
- Can it be as simple as [{}] or does it have to be more complicated? As it is, {} is escaped in regular superblanks, so an unescaped {} inside [] would have special meaning.
- Also, reformatters need to distribute the tags again; preferably merging consecutive tags, although that's probably not too important.
- Pretransfer will have to distribute the tags as well, so
[{<i>}]^foo<vblex>+bar<prn># fie$
turns into[{<i>}]^foo# fie<vblex>$ [{<i>}]^bar<prn>$
- Transfer modules have to treat the inline-blanks differently from other superblanks
- They should not be in the <b pos="N"/> elements, but probably be part of the <clip>
- For example: <clip pos="2" part="blank"/> where "blank" is a special part (similar to lemh/lemq/whole/tags) and using "blank" as a def-attr leads to a compile-time error.
- Note that if a word is deleted, we should be fine; removing an inline blank will not mess up HTML etc.
- Note also that a one-pattern rule will have zero superblanks, but one inline-blank. A two-pattern rule will have one superblank and two inline-blanks.
- They should not be in the <b pos="N"/> elements, but probably be part of the <clip>
Ensuring transfer rules output all regular superblanks
A separate, but related problem is that transfer rules some times forget to include all (regular) superblanks from the input. This can of course mess up HTML, and it is frustrating that the developer has to ensure all rules have the right number of <b pos="N"/>
, e.g. for a three-lu pattern we need to output both <b pos="1"/>
and <b pos="2"/>
.
This could be done mechanically by transfer at runtime instead of by the rule writer. Any rule will match a certain number of lu's, with one (super)blank between each lu (currently available in the b elements), and the action part will output a certain number of lu's.
- For a 1-pattern rule, there can be no superblanks between patterns, so there are no superblanks to output. This is the simple case.
- For a 2-pattern rule, there is exactly one superblank between patterns. Now we have to run the rule, and look at the output before printing it.
- If output contains zero or one chunks, put the superblank after the output.
- If output contains two or more chunks, put the superblank after the first chunk.
- Generalising this, look at the output, and interleave chunks and superblanks, that is:
- Read the first chunk, print that chunk, print the first superblank
- Read the second chunk, print that chunk, print the second superblank
- Etc. until all chunks are read, print remaining superblanks.
This can be made backwards compatible with existing rule files, by simply ignoring any existing <b> elements that have the pos attribute.