User:Khannatanmai/GSoC2020 Final Report
This is the final report for the GSoC 2020 project initially titled "Modifying the apertium stream format and eliminating dictionary trimming", which ultimately became "Modifying the apertium stream format to introduce wordbound blanks and solve the markup reordering problem".
Contents
The problem
Format/Markup handling has been a problem in Apertium and is described in detail here. The way format handling was done is using superblanks that protect markup information inside superblanks during the translation process. While this works well to protect the formatting of the document, during translation, words/phrases move around, get deleted, split, merge, etc., and the markup information on the words needs to go through these with the words, otherwise we will end up with erroneous markup, which is what happened in Apertium.
$ echo '<i>Perro</i> <b>blanco</b>' | apertium spa-eng -f html <i>White</i> <b>dog</b>
The solution
A new kind of blank was proposed - wordbound blank. This blank contains any information that needs to stay attached to a word/phrase during the translation process. At this point, this is markup information, but introducing wordbound blanks basically introduces alignments in Apertium and there are several use cases for information inside wordbound blanks.
[[...]]^...$
Wordbound blanks are identified by double square bracket delimiters and always appear right behind the Lexical Unit to which they are attached.
Method
A new deformatter and reformatter was written, that converts markup into either superblanks or wordbound blanks, and each module in the entire pipeline was modified to work with wordbound blanks.
When wordbound blanks are on text that isn't tokenised using LUs, there's a closing wordbound blank \/
to limit it's span. After the input is tokenised by the analyser, the program apertium-wblank-attach
attaches wordbound blanks to all the LUs inside the span of that wordbound blank and removes the closing wordbound blanks from the stream. Right before the generator, the opposite process happens with the apertium-wblank-detach
- closing wordbound blanks are put in the stream again in preparation for the reformatter.
Here's a snippet of the process:
Input: <i>Perro</i> <b>blanco</b> tf-extract: [[t:i:4_tPUA]]Perro[[/]] [[t:b:ls9z2Q]]blanco[[/]] Analyser (lt-proc -z): [[t:i:4_tPUA]]^Perro/Perro<n><m><sg>$[[/]] [[t:b:ls9z2Q]]^blanco/blanco<n><m><sg>/blanco<adj><m><sg>$[[/]] apertium-wblank-attach: [[t:i:4_tPUA]]^Perro/Perro<n><m><sg>$ [[t:b:ls9z2Q]]^blanco/blanco<n><m><sg>/blanco<adj><m><sg>$ . . . Transfer Output: [[t:b:ls9z2Q]]^White<adj><sint>$ [[t:i:4_tPUA]]^dog<n><sg>$ apertium-wblank-detach: [[t:b:ls9z2Q]]^White<adj><sint>$[[/]] [[t:i:4_tPUA]]^dog<n><sg>$[[/]] Generator (lt-proc -g): [[t:b:ls9z2Q]]White[[/]] [[t:i:4_tPUA]]dog[[/]] tf-inject: <b>White</b> <i>dog</i>