Difference between revisions of "Reordering superblanks"

Revision as of 14:09, 3 August 2016

Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium.

If the input is

<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>

and we want to reorder the words, we currently only reorder the words, and don't touch (or even look at) the blanks, since we don't want to mess up the html, so the output becomes

<a id="foobar" href="http://example.com">Бар <b>фоо</b>.</a>

but now the bold has shifted from source word "bar" to the target word that was "foo" in the input.

Ideally, the output should be

<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>

Problems

All language pairs do this kind of thing:

$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
<i>White</i> <b>dog</b>

And those that don't, will at some point mess up whatever formatting they're given.

The problem is not only that we bold or italicise the wrong word, but also that it limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information would be useful for systems like Mediawiki's Content Translation (see discussion).

A more serious problem, noted by User:Mlforcada and galaxyfeeder in this discussion, is that tags that are in a valid order in t1x can still be moved around inside chunks in t2x, e.g.

input:       <i>foo<i> <b>bar fum</b> fie
after t1x:  [<i>]^SN{^foo<adj>$[<i> <b>]^bar<n>$}$ ^SV{^fum<adv>$[</b> ]^fie<vblex>$}$
after t2x:  [<i>]^SV{^fum<adv>$[</b> ]^fie<vblex>$}$ ^SN{^foo<adj>$[<i> <b>]^bar<n>$}$

The t2x rules may have completely "correct" blank handling in that they output all input superblanks in the correct order, but they have no way of looking at the blanks that are inside the chunks, so they reorder them wrongly.

Possible solution

User:Tino Didriksen's post at http://comments.gmane.org/gmane.comp.nlp.apertium/3921 outlines a solution (a closely-related solution, with some additional details, is described in User:Mlforcada's entry on a Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm).

For each format, we need a list of inline/wordbound tags; for HTML this would include , and so on.

Other tags, like are treated similarly to before, but inline tags stick with their words:

Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into

    My <b><i>
    sister <b><i>
    lives <b><i>
    in <b><u>
    Wales <b><u>

Now on outputting, we can just put the inline tags on each word – this might mean some tags are unnecessarily duplicated, but that should be fine.

What we need to support something like this in Apertium:

Each deformatter needs a list of which tags need the inline treatment
Deformatters have to turn foo bar into something like [][{}]foo[] [{}]bar[]
- As it is, {} is escaped in regular superblanks, so an unescaped {} inside [] would have this special inline-blank meaning.
- To avoid ambiguity with multiwords and inconditionals, an inline-blank is closed by the nearest following (possibly empty) []
- Also, reformatters need to close the tags again, turning [][{}]foo[] [{}]bar[] into foo bar
Pretransfer will have to distribute the tags when splitting, so [{}]^foo<vblex>+bar<prn># fie$[] turns into [{}]^foo# fie<vblex>$ [{}]^bar<prn>$[]
Transfer modules have to treat the inline-blanks differently from other superblanks
- All regular superblanks are output before the rule-output
 - This means they cannot be reordered or deleted, solving the t2x/chunk-reordering issue mentioned above. This also deals with the issue mentioned by Sergio that transfer rule writers forget to output b elements, or output them in the wrong order.
- Regular unmarked blanks, freeblanks (spaces, etc that are not inside []</code>) which are immediately before a word are output whenever there's a in the rule; each such blank is output exactly once. If they're not used up by b-elements in the rule, the remaining freeblanks are output after the rule output. Thus we output all and only those unanalysed chars that were in the input, and in the same order. The pos="N" in no longer has any significance and is ignored.


The reformatter also needs to know where to close the tags. We can't just close the tags on whitespace, since we can have inconditionals and multiwords. So we need the end of the pipeline to be something like foo [{<b>}]bar[]! so we can get foo <b>bar</b>! instead of foo <b>bar!</b> (or vice versa) – similarly, for the tokenise-as-you-analyse to know how to distribute inline blanks on the correct tokens: <b>foo</b>? vs <b>foo?</b> (if lt-proc sees just the opening tag, it doesn't know if the "!" should have a preceding [{<b>}] or not); see https://github.com/junaidiiith/Apertium_Code/issues/7 for more examples.

Implementation(s)

https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling

See also

Format handling
https://www.mediawiki.org/wiki/Content_translation/Markup#Annotation_mapping_using_translation_subsequence_approximation how mediawiki bravely works around Apertium's limitations
https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/mt code

@@ Line 72: / Line 72: @@
 ==Implementation(s)==
-* https://github.com/junaidiiith/Apertium_Code/ GsoC2016 project
+* https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
 * https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling

Difference between revisions of "Reordering superblanks"

Revision as of 14:09, 3 August 2016

Contents

Problems

Possible solution

Implementation(s)

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools