Ideas for Google Summer of Code/superblank handling algorithm

From Apertium
Jump to navigation Jump to search

Here is an attempt to formalize a bit an alternative to the actual superblank handling strategies used in Apertium, to try to solve the problem of Reordering superblanks, and the strategy is basically equivalent to User:Tino Didriksen's approach, which in turn is related to the approach in Gramtrans and also the one used by Wikipedia Content Translation. Unlike in User:Tino Didriksen's, the method here does not distinguish block tags from inline tags, and avoids outputting closing tags that will be opened again, but this may need to be checked

This will be illustrated with an example using XML-style tags:

  • Before any reordering of words takes place, tags are handled with a stack; tags between words are removed and the contents of the tag stack is associated to each words.
  • After all reordering has occurred, tag stacks for consecutive words are compared to decide what to output.

Stealing (and slightly modifying) User:Tino Didriksen's example:

<p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p>

the process before the reordering, step by step, would be:

Input Stack after it Output after it Comment
(start) 0 Number IDs in stacks may be pointers to where the actual tags are stored; the name of the tag is included for clarity
<p> 0 p 1 A new ID is generated whenever a new element is stacked
<b> 0 p 1 b 2
<i> 0 p 1 b 2 i 3
my 0 p 1 b 2 i 3 0 p 1 b 2 i 3 my the current stack is attached to the word as output
sister 0 p 1 b 2 i 3 0 p 1 b 2 i 3 sister
</i> 0 p 1 b 2 no output, and </i> pops (i 3) from the stack
<br/> 0 p 1 b 2 br 4# now, self-closing tags generate "ephemerous states" that will have special processing
lives 0 p 1 b 2 0 p 1 b 2 br #4 lives the ephemerous state is attached to the word but removed
</b> 0 p 1 no output, and </b> pops (b 2) from the stack
<u> 0 p 1 u 5 push <u> and generate a new id
in 0 p 1 u 5 0 p 1 u 5 in
Wales 0 p 1 u 5 0 p 1 u 5 Wales
</u> 0 p 1 no output, and </u> pops (u 5) from the stack
</p> 0 no output, and </u> pops (p 1) from the stack
(end) processing finished

Then, after reordering (for instance, into a Turkic-style language to generate sister my Wales in lives, the output would be processed as follows:

Input Intermediate storage Output Comment
(start) 0 Start with an empty stack
0 p 1 b 2 i 3 sister 0 p 1 b 2 i 3 <p><b><i>sister The stack before the word (0) and after the word (0 p 1 b 2 i 3) are compared, the longest common prefix (LCP), in this case 0, is computed, the suffix from the original stack (nothing in this case) is stripped, and its closing tags (none in this case) are output, and the opening or empty tags in the suffix from the new stack (here, p 1 b 2 i 3) are output. Then the word is written.
0 p 1 b 2 i 3 my 0 p 1 b 2 i 3 my The stack before the word (0 p 1 b 2 i 3) and after the word (0 p 1 b 2 i 3) are compared, the longest common prefix (LCP), in this case 0 p 1 b 2 i 3, is computed, the suffix from the original stack (nothing in this case) is stripped, and its closing tags (none in this case) are output, and the opening or empty tags in the suffix from the new stack (nothing in this case) are output. Then the word is written.
0 p 1 b 2 br #4 lives 0 p 1 b 2 i 3
lives
The stack before the word (0 p 1 b 2 i 3) and after the word (0 p 1 b 2 br #4) are compared, the longest common prefix (LCP), in this case 0 p 1 b 2, is computed, the suffix from the original stack ( i 3 ) is stripped, and its closing tags (</i> in this case) are output, and the opening or empty tags in the suffix from the new stack (br #4) are output. Then the word is written.
0 p 1 u 5 Wales 0 p 1 b 2 Wales The stack before the word (0 p 1 b 2) and after the word (0 p 1 u 5) are compared, the longest common prefix (LCP), in this case 0 p 1, is computed, the suffix from the original stack (b 2) is stripped, and its closing tags (</b> in this case) are output, and the opening or empty tags in the suffix from the new stack (u 5) are output. Then the word is written.
0 p 1 u 5 in 0 p 1 u 5 in The stack before the word (0 p 1 u 5) and after the word (0 p 1 u 5) are compared, the longest common prefix (LCP), in this case 0 p 1 u 5, is computed, the suffix from the original stack (nothing) is stripped, and its closing tags (nothing in this case) are output, and the opening or empty tags in the suffix from the new stack (nothing) are output. Then the word is written.
(end) 0 </u></p> All remaining closing tags are output at the end