Ideas for Google Summer of Code/superblank handling algorithm
Here is an attempt to formalize a bit an alternative to the actual superblank handling strategies used in Apertium, to try to solve the problem of Reordering superblanks, and the strategy is basically equivalent to User:Tino Didriksen's approach, which in turn is related to the approach in Gramtrans and also the one used by Wikipedia Content Translation. Unlike in User:Tino Didriksen's, the method here does not distinguish block tags from inline tags, and avoids outputting closing tags that will be opened again, but this may need to be checked
This will be illustrated with an example using XML-style tags:
- Before any reordering of words takes place, tags are handled with a stack; tags between words are removed and the contents of the tag stack is associated to each words.
- After all reordering has occurred, tag stacks for consecutive words are compared to decide what to output.
Stealing (and slightly modifying) User:Tino Didriksen's example:
<p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p>
the process before the reordering, step by step, would be:
Input | Stack after it | Output after it | Comment |
---|---|---|---|
(start) | 0 | Number IDs in stacks may be pointers to where the actual tags are stored; the name of the tag is included for clarity | |
<p> | 0 p 1 | A new ID is generated whenever a new element is stacked | |
<b> | 0 p 1 b 2 | ||
<i> | 0 p 1 b 2 i 3 | ||
my | 0 p 1 b 2 i 3 | 0 p 1 b 2 i 3 my | the current stack is attached to the word as output |
sister | 0 p 1 b 2 i 3 | 0 p 1 b 2 i 3 sister | |
</i> | 0 p 1 b 2 | no output, and </i> pops (i 3) from the stack | |
<br/> | 0 p 1 b 2 br 4# | now, self-closing tags generate "ephemerous states" that will have special processing | |
lives | 0 p 1 b 2 | 0 p 1 b 2 br #4 lives | the ephemerous state is attached to the word but removed |
</b> | 0 p 1 | no output, and </b> pops (b 2) from the stack | |
<u> | 0 p 1 u 5 | push <u> and generate a new id | |
in | 0 p 1 u 5 | 0 p 1 u 5 in | |
Wales | 0 p 1 u 5 | 0 p 1 u 5 Wales | |
</u> | 0 p 1 | no output, and </u> pops (u 5) from the stack | |
</p> | 0 | no output, and </u> pops (p 1) from the stack | |
(end) | processing finished |
Then, after reordering (for instance, into a Turkic-style language) to generate sister my Wales in lives, the output would be processed as follows:
Input | Intermediate storage | Output | Comment |
---|---|---|---|
(start) | 0 | Start with an empty stack | |
0 p 1 b 2 i 3 sister | 0 p 1 b 2 i 3 | <p><b><i>sister | The stack before the word (0) and after the word (0 p 1 b 2 i 3) are compared, the longest common prefix (LCP), in this case 0, is computed, the suffix from the original stack (nothing in this case) is stripped, and its closing tags (none in this case) are output, and the opening or empty tags in the suffix from the new stack (here, p 1 b 2 i 3) are output. Then the word is written. |
0 p 1 b 2 i 3 my | 0 p 1 b 2 i 3 | my | The stack before the word (0 p 1 b 2 i 3) and after the word (0 p 1 b 2 i 3) are compared, the LCP 0 p 1 b 2 i 3, is computed, the suffix from the original stack (nothing in this case) is stripped, and its closing tags (none in this case) are output, and the opening or empty tags in the suffix from the new stack (nothing in this case) are output. Then the word is written. |
0 p 1 b 2 br #4 lives | 0 p 1 b 2 i 3 | lives |
The stack before the word (0 p 1 b 2 i 3) and after the word (0 p 1 b 2 br #4) are compared, the LCP 0 p 1 b 2, is computed, the suffix from the original stack ( i 3 ) is stripped, and its closing tags (</i> in this case) are output, and the opening or empty tags in the suffix from the new stack (br #4) are output. Then the word is written. |
0 p 1 u 5 Wales | 0 p 1 b 2 | </b><u>Wales | The stack before the word (0 p 1 b 2) and after the word (0 p 1 u 5) are compared, the LCP 0 p 1, is computed, the suffix from the original stack (b 2) is stripped, and its closing tags (</b> in this case) are output, and the opening or empty tags in the suffix from the new stack (u 5) are output. Then the word is written. |
0 p 1 u 5 in | 0 p 1 u 5 | in | The stack before the word (0 p 1 u 5) and after the word (0 p 1 u 5) are compared, LCP 0 p 1 u 5, is computed, the suffix from the original stack (nothing) is stripped, and its closing tags (nothing in this case) are output, and the opening or empty tags in the suffix from the new stack (nothing) are output. Then the word is written. |
(end) | 0 | </u></p> | All remaining closing tags are output at the end |
The result is close to what one could desire
<p><b><i>sister my </i></br>lives </b><u>Wales in </u></p>
Comments and issues:
- Attaching tag stacks to words may necessitate a change in the lexical form formats that the structural transfer modules deal with.
- There would apparently be no need to separate block from inline tags, but this needs to be checked.
- Attributes containing text to be translated (for instance the alt="..." attribute of an img tag) would have to be dealt with using some hack (turning them into fake element content?)
- If an element has an attribute of type ID, it may be split in two elements with the same ID. There could be some kind of cache keeping track of IDs to avoid duplicating them.
- What happens if new words are added by a structural transfer rule? What should they be equipped with?
- Here's what I assumed when I looked at it last: if a rule says <code><out>…<lu>…<clip pos…"1" part…"lemh"/></code> then we output that wordblank (i/b/etc., not a div, not an unanalysed "space") in front of the <code><lu>. But if the clips are just used in concat with no lu, or the output is only lit or lit-tag, we can’t know where to output the wordblank. In those cases, the transfer rule writer has to actually use an attribute hint <code><lu blankfrom…"1">. From my old notes at https://github.com/unhammer/apertium/blob/3f80f35503656f70db2386c0b5316d502282a689/blank_notes.org --unhammer (talk) 09:49, 15 February 2016 (CET)
- What happens if words are deleted by a structural transfer rule?
- Not a problem if you ensure they only have inline tags (what I called wordblanks). So structural tags (phrasebound superblanks) are always output outside of rules, and unbound blanks (unanalysed spaces or characters like "€" which might not be in the analyser alphabet) have to be stored separately from the words and apertium-transfer just has to ensure any remaining non-space such are output after the rule if they weren't output by a <code><b/></code> instruction. --unhammer (talk) 09:53, 15 February 2016 (CET)
- If a new id is generated each time a tag is pushed to the stack, then some </tag><tag> sequences could occur due to different IDs.
- Documents may have very deep structure: User:Unhammer gives the example of http://www.vg.no which contains hundreds of div nestings. Most of these are block nestings, which User:Tino Didriksen treats as hard boundaries (like sentence boundaries), and no reordering crosses them. A catalog of these tags would be necessary. User:Tino Didriksen says: "I have several lists. One for inline tags, one for protected tags (where nothing inside may be touched, even if they are inline, e.g. <code>), one for which attributes should be considered textual (e.g. title), and some others that are relevant for other formats. The quirk with e.g. <code> is that's sometimes an inline tag, but the whole contents must be passed through untouched, so it needs to be anchored on an adjacent token and just let through."