Ideas for Google Summer of Code/superblank handling algorithm
Here is an attempt to formalize a bit an alternative to the actual superblank handling strategies used in Apertium, to try to solve the problem of Reordering superblanks, and the strategy is basically equivalent to User:Tino Didriksen's approach, which in turn is related to the approach in Gramtrans and also the one used by Wikipedia Content Translation. Unlike in User:Tino Didriksen's, the method here does not distinguish block tags from inline tags, and avoids outputting closing tags that will be opened again, but this may need to be checked
This will be illustrated with an example using XML-style tags:
- Before any reordering of words takes place, tags are handled with a stack; tags between words are removed and the contents of the tag stack is associated to each words.
- After all reordering has occurred, tag stacks for consecutive words are compared to decide what to output.
Stealing (and slightly modifying) User:Tino Didriksen's example:
<p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p>
the process before the reordering, step by step, would be:
Input | Stack after it | Output after it | Comment |
---|---|---|---|
(start) | 0 | Number IDs in stacks may be pointers to where the actual tags are stored; the name of the tag is included for clarity | |
<p> | 0 p 1 | A new ID is generated whenever a new element is stacked | |
<b> | 0 p 1 b 2 | ||
<i> | 0 p 1 b 2 i 3 | ||
my | 0 p 1 b 2 i 3 | 0 p 1 b 2 i 3 my | the current stack is attached to the word as output |
sister | 0 p 1 b 2 i 3 | 0 p 1 b 2 i 3 sister | |
</i> | 0 p 1 b 2 | no output, and </i> pops (i 3) from the stack | |
<br/> | 0 p 1 b 2 br 4# | now, self-closing tags generate "ephemerous states" that will have special processing | |
lives | 0 p 1 b 2 | 0 p 1 b 2 br #4 lives | the ephemerous state is attached to the word but removed |
</b> | 0 p 1 | no output, and </b> pops (b 2) from the stack | |
<u> | 0 p 1 u 4 | push <u> and generate a new id | |
in | 0 p 1 u 4 | 0 p 1 u 4 in | |
Wales | 0 p 1 u 4 | 0 p 1 u 4 Wales | |
</u> | 0 p 1 | no output, and </u> pops (u 4) from the stack | |
</p> | 0 | no output, and </u> pops (p 4) from the stack | |
(end) | processing finished |
Then, after reordering (for instance, into a Turkic-style language to generate sister my Wales in lives, the output would be processed as follows:
Input | Intermediate storage | Output | Comment |
---|---|---|---|
(start) | 0 | <nowiki>
|
Start with an empty stack |