Difference between revisions of "Reordering superblanks"

Latest revision as of 18:32, 21 June 2020

Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium.

If the input is

<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>

and we want to reorder the words, we currently only reorder the words, and don't touch (or even look at) the blanks, since we don't want to mess up the html, so the output becomes

<a id="foobar" href="http://example.com">Бар <b>фоо</b>.</a>

but now the bold has shifted from source word "bar" to the target word that was "foo" in the input.

Ideally, the output should be

<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>

Problems[edit]

All language pairs do this kind of thing:

$ echo '<i>Perro</i> <b>blanco</b>' |apertium es-en -f html
<i>White</i> <b>dog</b>

And those that don't, will at some point mess up whatever formatting they're given.

The problem is not only that we bold or italicise the wrong word, but also that it limits any possibility of accurately finding out which words were reordered during translation. This kind of reordering information would be useful for systems like Mediawiki's Content Translation (see discussion).

A more serious problem, noted by User:Mlforcada and galaxyfeeder in this discussion, is that tags that are in a valid order in t1x can still be moved around inside chunks in t2x, e.g.

input:       <i>foo<i> <b>bar fum</b> fie
after t1x:  [<i>]^SN{^foo<adj>$[<i> <b>]^bar<n>$}$ ^SV{^fum<adv>$[</b> ]^fie<vblex>$}$
after t2x:  [<i>]^SV{^fum<adv>$[</b> ]^fie<vblex>$}$ ^SN{^foo<adj>$[<i> <b>]^bar<n>$}$

The t2x rules may have completely "correct" blank handling in that they output all input superblanks in the correct order, but they have no way of looking at the blanks that are inside the chunks, so they reorder them wrongly.

This is important to people:

People in the real world don't care about a tiny increase in BLEU score; they want tags handled properly! #eamt2017
https://twitter.com/tarfandy/status/869195419494096897

Possible solution[edit]

User:Tino Didriksen's post at https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg04449.html outlines a solution (a closely-related solution, with some additional details, is described in User:Mlforcada's entry on a Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm).

For each format, we need a list of inline/wordbound tags; for HTML this would include , and so on.

Other tags, like are treated similarly to before, but inline tags stick with their words:

Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into

    My <b><i>
    sister <b><i>
    lives <b><i>
    in <b><u>
    Wales <b><u>

Now on outputting, we can just put the inline tags on each word – this might mean some tags are unnecessarily duplicated, but that should be fine.

What we need to support something like this in Apertium:

Each deformatter needs a list of which tags need the inline treatment
Deformatters have to turn foo bar into something like [][{}]foo[] [{}]bar[]
- As it is, {} is escaped in regular superblanks, so an unescaped {} inside [] would have this special inline-blank meaning.
- To avoid ambiguity with multiwords and inconditionals, an inline-blank is closed by the nearest following (possibly empty) []
- Also, reformatters need to close the tags again, turning [][{}]foo[] [{}]bar[] into foo bar
Pretransfer will have to distribute the tags when splitting, so [{}]^foo<vblex>+bar<prn># fie$[] turns into [{}]^foo# fie<vblex>$ [{}]^bar<prn>$[]
Transfer modules have to treat the inline-blanks differently from other superblanks
- All regular superblanks are output before the rule-output
 - This means they cannot be reordered or deleted, solving the t2x/chunk-reordering issue mentioned above. This also deals with the issue mentioned by Sergio that transfer rule writers forget to output b elements, or output them in the wrong order.
- Regular unmarked blanks, freeblanks (spaces, etc that are not inside []) which are immediately before a word are output whenever there's a  in the rule; each such blank is output exactly once. If they're not used up by b-elements in the rule, the remaining freeblanks are output after the rule output. Thus we output all and only those unanalysed chars that were in the input, and in the same order.
 - The pos="N" in  no longer has any significance and is ignored.
- Inline format blanks are output before each <lu/>. Since they may be reordered, split or duplicated in the output, we look at the clips to find out which formatblank goes before which lu. So if e.g. the rule is about to output <clip pos="1" part="lemh"/>, we might use the inline-blank (if any) from the first word, We use a prioritised list, so prefer part="lemh" over part="tags", and the rule writer may override this by giving a position in the lu-attribute blankfrom.
 - Note that if a word is deleted, we should be fine; removing an inline blank will not mess up HTML etc.
The reformatter also needs to know where to close the tags. We can't just close the tags on whitespace, since we can have inconditionals and multiwords. So we need the end of the pipeline to be something like foo [{}]bar[]! so we can get foo bar! instead of foo bar! (or vice versa) – similarly, for the tokenise-as-you-analyse to know how to distribute inline blanks on the correct tokens: foo? vs foo? (if lt-proc sees just the opening tag, it doesn't know if the "!" should have a preceding [{}] or not); see #Closing inline-blanks for more examples.

Closing inline-blanks[edit]

Consider the html input de jour?. This may be tokenised by lt-proc as "de jour" and "?" (with the apertium-deshtml|lt-proc we get []^de jour/de jour<adv>$^?/?<sent>$[<\/i>], but our deformatter doesn't (and can't) know where the token borders are.

If we turn this into [{}]de [{}]jour? then lt-proc will fail to notice the multiword expression (since there's more than just a simple space in between); it'll basically break all formatted multiwords.

But what's worse is the "?" – if we just "split on spaces" and turn foo? Bar into [{}]foo? Bar , how does lt-proc know that it's supposed to output [{}]^foo/foo<ij>$[{}]^?/?<sent>$ ^Bar/Bar<n>$ and not [{}]^foo/foo<ij>$^?/?<sent>$ ^Bar/Bar<n>$? (How does it know that the ? also is in italics?).

This is also an issue in generation – if the end of the pipeline is foo [{}]bar! we don't know if that's foo bar! or foo bar!.

The simplest solution is that we treat inline blanks as unclosed until the next [. So when we reach the end-tag , we have to we have to start on a superblank, even if the next character is a non-blank (if it is, we immediately close the superblank). So

de jour? Yes →deform→ [{}]de jour?[] Yes →translate→ dagens? Ja
de jour? Yes →deform→ [{}]de jour[]? Yes →translate→ dagens? Ja
de jour? Yes →deform→ [{}]de[] jour? Yes →translate→ av dagen? Ja
de jour?<div>Yes</div> →deform→ [{}]de jour?[<div>]Yes[</div>] →translate→ dagens?<div>Ja</div>

(Originally discussed at https://github.com/junaidiiith/Apertium_Code/issues/7 )

However, this should only matter when we're at the "edges" of the pipeline. When a word is tokenised, we know that an inline blank stops having effect on seeing the $. So e.g. pretransfer and transfer don't have to put a [] after every single token.

The rule: inline blanks have effect until the next superblank or the $ of a token.

Implementation(s)[edit]

https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling

@@ Line 27: / Line 27: @@
-A more serious problem, noted by [[User:Mlforcada]] in this [http://permalink.gmane.org/gmane.comp.nlp.apertium/3916 discussion], is that tags that are in a valid order in t1x can still be moved around inside chunks in t2x, e.g.
+A more serious problem, noted by [[User:Mlforcada]] and galaxyfeeder in this [http://permalink.gmane.org/gmane.comp.nlp.apertium/3916 discussion], is that tags that are in a valid order in t1x can still be moved around inside chunks in t2x, e.g.
 <pre>
 input:       <i>foo<i> <b>bar fum</b> fie
@@ Line 33: / Line 33: @@
 after t2x:  [<i>]^SV{^fum<adv>$[</b> ]^fie<vblex>$}$ ^SN{^foo<adj>$[<i> <b>]^bar<n>$}$
 </pre>
+The t2x rules may have completely "correct" blank handling in that they output all input superblanks in the correct order, but they have no way of looking at the blanks that are inside the chunks, so they reorder them wrongly.
+This is important to people: <blockquote cite="@tarfandy">People in the real world don't care about a tiny increase in BLEU score; they want tags handled properly! #eamt2017<br/>https://twitter.com/tarfandy/status/869195419494096897</blockquote>
 ==Possible solution==
-[[User:Tino Didriksen]]'s post at http://comments.gmane.org/gmane.comp.nlp.apertium/3921 outlines a solution:
+[[User:Tino Didriksen]]'s post at https://www.mail-archive.com/apertium-stuff@lists.sourceforge.net/msg04449.html outlines a solution (a closely-related solution, with some additional details, is described in [[User:Mlforcada]]'s entry on a [[Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm]]).
-For each format, we need a list of '''inline tags'''; for HTML this would include &lt;b&gt;, &lt;i&gt; and so on.
+For each format, we need a list of '''inline/wordbound tags'''; for HTML this would include &lt;b&gt;, &lt;i&gt; and so on.
-Other tags, like &lt;p&gt; are treated as before, but inline tags stick with their words:
+Other tags, like &lt;p&gt; are treated similarly to before, but inline tags stick with their words:
 <pre>
 Given input string "<p><b><i>My sister lives</i> <u>in Wales</u></b></p>" you turn that into
@@ Line 56: / Line 60: @@
 # Each deformatter needs a list of which tags need the inline treatment
-# Deformatters have to turn <code><nowiki><p><b><i>foo</i> bar</b></p></nowiki></code> into something like <code><nowiki>[<p>][{<b><i>}]foo [{<b>}]bar[</p>]</nowiki></code>
+# Deformatters have to turn <code><nowiki><p><b><i>foo</i> bar</b></p></nowiki></code> into something like <code><nowiki>[<p>][{<b><i>}]foo[] [{<b>}]bar[</p>]</nowiki></code>
-#* Can it be as simple as <nowiki>[{}]</nowiki> or does it have to be more complicated? As it is, {} is escaped in regular superblanks, so an unescaped {} inside [] would have special meaning.
+#* As it is, {} is escaped in regular superblanks, so an unescaped {} inside [] would have this special inline-blank meaning.
+#* To avoid ambiguity with multiwords and inconditionals, an inline-blank is closed by the nearest following (possibly empty) <code><nowiki>[]</nowiki></code>
-#* Also, reformatters need to distribute the tags again; preferably merging consecutive tags, although that's probably not too important.
-# Pretransfer will have to distribute the tags as well, so <code><nowiki>[{<i>}]^foo<vblex>+bar<prn># fie$</nowiki></code> turns into <code><nowiki>[{<i>}]^foo# fie<vblex>$ [{<i>}]^bar<prn>$</nowiki></code>
+#* Also, reformatters need to close the tags again, turning <code><nowiki>[<p>][{<b><i>}]foo[] [{<b>}]bar[</p>]</nowiki></code> into <code><nowiki><p><b><i>foo</i></b> <b>bar</b></p></nowiki></code>
+# Pretransfer will have to distribute the tags when splitting, so <code><nowiki>[{<i>}]^foo<vblex>+bar<prn># fie$[]</nowiki></code> turns into <code><nowiki>[{<i>}]^foo# fie<vblex>$ [{<i>}]^bar<prn>$[]</nowiki></code>
 # Transfer modules have to treat the inline-blanks differently from other superblanks
+#* All '''regular superblanks''' are output before the rule-output
-#* They should ''not'' be in the <nowiki><b pos="N"/></nowiki> elements, but probably be part of the <nowiki><clip></nowiki>
+#** This means they cannot be reordered or deleted, solving the t2x/chunk-reordering issue mentioned above.  This also deals with the [https://sourceforge.net/p/apertium/mailman/apertium-stuff/thread/20cf28cd0904300204v45f35e51i118f4d146f83748@mail.gmail.com/ issue mentioned by Sergio] that transfer rule writers forget to output b elements, or output them in the wrong order.
-#** For example: <nowiki><clip pos="2" part="blank"/></nowiki> where "blank" is a special part (similar to lemh/lemq/whole/tags) and using "blank" as a def-attr leads to a compile-time error.
+#* Regular unmarked blanks, '''freeblanks''' (spaces, etc that are not inside <code><nowiki>[]</nowiki></code>) which are immediately before a word are output whenever there's a <code><nowiki><b/></nowiki></code> in the rule; each such blank is output exactly once. If they're not used up by b-elements in the rule, the remaining freeblanks are output after the rule output. Thus we output all and only those unanalysed chars that were in the input, and in the same order.
+#** The pos="N" in <code><nowiki><b pos="N"/></nowiki></code> no longer has any significance and is ignored.
+#* Inline '''format blanks''' are output before each <code><nowiki><lu/></nowiki></code>. Since they may be reordered, split or duplicated in the output, we look at the clips to find out which formatblank goes before which lu. So if e.g. the rule is about to output <code><nowiki><clip pos="1" part="lemh"/></nowiki></code>, we might use the inline-blank (if any) from the first word, We use a prioritised list, so prefer part="lemh" over part="tags", and the rule writer may override this by giving a position in the lu-attribute blankfrom.
 #** Note that if a word is deleted, we should be fine; removing an inline blank will not mess up HTML etc.
+# The reformatter also needs to know where to close the tags. We can't just close the tags on whitespace, since we can have inconditionals and multiwords. So we need the end of the pipeline to be something like <code><nowiki>foo [{<b>}]bar[]!</nowiki></code> so we can get <code><nowiki>foo <b>bar</b>!</nowiki></code> instead of <code><nowiki>foo <b>bar!</b></nowiki></code> (or vice versa) – similarly, for the tokenise-as-you-analyse to know how to distribute inline blanks on the correct tokens: <code><nowiki><b>foo</b>?</nowiki></code> vs <code><nowiki><b>foo?</b></nowiki></code> (if lt-proc sees just the opening tag, it doesn't know if the "!" should have a preceding <code><nowiki>[{<b>}]</nowiki></code> or not); see [[#Closing inline-blanks]] for more examples.
-#** Note also that a one-pattern rule will have zero superblanks, but one inline-blank. A two-pattern rule will have one superblank and two inline-blanks.
+===Closing inline-blanks===
+Consider the html input <code><nowiki><i>de jour?</i></nowiki></code>. This may be tokenised by lt-proc as "de jour" and "?" (with the apertium-deshtml|lt-proc we get <code><nowiki>[<i>]^de jour/de jour<adv>$^?/?<sent>$[<\/i>]</nowiki></code>, but our deformatter doesn't (and can't) know where the token borders are.
+If we turn this into <code><nowiki>[{<i>}]de [{<i>}]jour?</nowiki></code> then lt-proc will fail to notice the multiword expression (since there's more than just a simple space in between); it'll basically break all formatted multiwords.
+But what's worse is the "?" – if we just "split on spaces" and turn <code><nowiki><i>foo?</i> Bar</nowiki></code> into  <code><nowiki>[{<i>}]foo? Bar</nowiki></code> , how does lt-proc know that it's supposed to output <code><nowiki>[{<i>}]^foo/foo<ij>$[{<i>}]^?/?<sent>$ ^Bar/Bar<n>$</nowiki></code> and not <code><nowiki>[{<i>}]^foo/foo<ij>$^?/?<sent>$ ^Bar/Bar<n>$</nowiki></code>? (How does it know that the <code><nowiki>?</nowiki></code> also is in italics?).
+This is also an issue in generation – if the end of the pipeline is <code><nowiki>foo [{<b>}]bar!</nowiki></code> we don't know if that's <code><nowiki>foo <b>bar</b>!</nowiki></code> or <code><nowiki>foo <b>bar!</b></nowiki></code>.
+The simplest solution is that we treat inline blanks as unclosed until the next <code><nowiki>[</nowiki></code>. So when we reach the end-tag <code><nowiki></i></nowiki></code>, we have to we have to start on a superblank, even if the next character is a non-blank (if it is, we immediately close the superblank). So
+* <code><nowiki><i>de jour?</i> Yes</nowiki></code> →deform→ <code><nowiki>[{<i>}]de jour?[] Yes</nowiki></code> →translate→ <code><nowiki><i>dagens?</i> Ja</nowiki></code>
+* <code><nowiki><i>de jour</i>? Yes</nowiki></code> →deform→ <code><nowiki>[{<i>}]de jour[]? Yes</nowiki></code> →translate→ <code><nowiki><i>dagens</i>? Ja</nowiki></code>
+* <code><nowiki><i>de</i> jour? Yes</nowiki></code> →deform→ <code><nowiki>[{<i>}]de[] jour? Yes</nowiki></code> →translate→ <code><nowiki><i>av</i> dagen? Ja</nowiki></code>
+* <code><nowiki><i>de jour?</i><div>Yes</div></nowiki></code> →deform→ <code><nowiki>[{<i>}]de jour?[<div>]Yes[</div>]</nowiki></code> →translate→ <code><nowiki><i>dagens?</i><div>Ja</div></nowiki></code>
+(Originally discussed at https://github.com/junaidiiith/Apertium_Code/issues/7 )
+However, this should only matter when we're at the "edges" of the pipeline. When a word is tokenised, we know that an inline blank stops having effect on seeing the <code><nowiki>$</nowiki></code>. So e.g. pretransfer and transfer don't have to put a [] after every single token.
+The rule: inline blanks have effect until the next superblank or the <code><nowiki>$</nowiki></code> of a token.
+==Implementation(s)==
+* https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
+* https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling
+==See also==
-==Ensuring transfer rules output all regular superblanks==
+* [[Ideas_for_Google_Summer_of_Code/Automatic_blank_handling]]
-A separate, but related problem is that transfer rules some times forget to include all (regular) superblanks from the input. This can of course mess up HTML, and it is frustrating that the developer has to ensure all rules have the right number of <code><nowiki><b pos="N"/></nowiki></code>, e.g. for a three-lu pattern we need to output both <code><nowiki><b pos="1"/></nowiki></code> and <code><nowiki><b pos="2"/></nowiki></code>.
+* [[Format handling]]
+* https://www.mediawiki.org/wiki/Content_translation/Markup#Annotation_mapping_using_translation_subsequence_approximation how mediawiki bravely works around Apertium's limitations
+** https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/mt code
-This could be done mechanically by transfer at runtime instead of by the rule writer. Any rule will match a certain number of lu's, with one (super)blank between each lu (currently available in the b elements), and the action part will output a certain number of lu's.
-* For a 1-pattern rule, there can be no superblanks between patterns, so there are no superblanks to output. This is the simple case.
-* For a 2-pattern rule, there is exactly one superblank between patterns. Now we have to run the rule, and look at the output before printing it.
-** If output contains zero or one chunks, put the superblank after the output.
-** If output contains two or more chunks, put the superblank after the first chunk.
-* Generalising this, look at the output, and interleave chunks and superblanks, that is:
-** Read the first chunk, print that chunk, print the first superblank
-** Read the second chunk, print that chunk, print the second superblank
-** Etc. until all chunks are read, print remaining superblanks.
+[[Category:Documentation]]
-This can be made backwards compatible with existing rule files, by simply ignoring any existing &lt;b&gt; elements that have the pos attribute.
+[[Category:Formats]]
+[[Category:Documentation in English]]

Difference between revisions of "Reordering superblanks"

Latest revision as of 18:32, 21 June 2020

Contents

Problems[edit]

Possible solution[edit]

Closing inline-blanks[edit]

Implementation(s)[edit]

See also[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools