Ideas for Google Summer of Code/Automatic blank handling
- See also Superblank Handling Algorithm
In progress
See http://wiki.apertium.org/wiki/User:SilentFlame/Progress for GsoC2017 progress on this task.
A superblank is something that we don't want to translate, but keep in the output, often things like formatting tags.
Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium. Transfer rules can reorder words, but never look at the tags themselves, and have no general way of allowing the tags themselves to be reordered along with the words, without potentially messing up the resulting code (e.g. putting an end tag before a start tag).
There is a solution: Split what we currently call superblanks into "inline" and "block" blanks; let the format handler treat some tags as inline (e.g. <em/>, <b/>) and others as block-level (e.g. <div>, <p>). Inline tags always stick with their words. If there are several words covered by one inline tag, they're duplicated to each word, and since they're "glued to the word", a transfer rule can move that word around without worrying about blanks. Block-level tags in the input to a transfer rule are always output before the output that a transfer rule gives; so a <p> will never be moved around by a rule.
This GsoC project involves changing the whole Apertium pipeline (lttoolbox, apertium, apertium-lex-tools, and if you have time also CG-3 and HFST) to do this new automatic blank handling, in particular deformatters and the the transfer module(s). A prototype of this work was implemented in 2016, but there is still much work that remains.
Read Format handling and Apertium stream format for background, then Reordering superblanks for some more explanations of the problem and solution.
Background[edit]
Currently, when you run the deformatter, the superblanks are delimited by []
, e.g. (after Installation of apertium-all-dev
)
$ echo '<p>one, <i>two and</i> three</p>' |apertium-deshtml -n [][<p>]one,[ <i>]two and[<\/i> ]three[][<\/p> ]
When this is run through the translator pipeline, only "one two and three" will be translated, and the superblanks should be treated as if they were simply spaces.
As an example, here's the first step of apertium-en-es:
$ apertium-get en-es $ cd apertium-en-es $ echo '<p>one, <i>two and</i> three</p>' |apertium-deshtml -n|lt-proc en-es.automorf.bin [][<p>]^one/one<num><sg>/one<prn><tn><mf><sg>$^,/,<cm>$[ <i>]^two/two<num><sp>$ ^and/and<cnjcoo>$[<\/i> ]^three/three<num><sp>$[][<\/p> ]
With this format, there is no way for the individual modules like lt-proc or apertium-transfer to separate inline from block level tags (without inspecting the superblanks and knowing about all tags of html, latex etc.).
With our new deformatter, we'd like to instead have
$ echo '<p>one, <i>two and</i> three</p>' | new/apertium-deshtml -n [][<p>]one, [{<i>}]two and[] three[][<\/p> ]
where the new [{}]
style blanks are inline-blanks, which should be bound to word, while []
is as before. In the output from deformatters (or input to reformatters), a new blank always ends the previous inline-blank, that's why we have an empty superblank []
after "and" above. Inside the stream, an inline-blank is always bound to the lexical unit, so lt-proc needs to correctly disperse inline-blanks on each lexical unit covered by the inline blank:
$ echo '<p>one, <i>two and</i> three</p>' | new/apertium-deshtml -n | new/lt-proc en-es.automorf.bin [][<p>]^one/one<num><sg>/one<prn><tn><mf><sg>$^,/,<cm>$ [{<i>}]^two/two<num><sp>$ [{<i>}]^and/and<cnjcoo>$ ^three/three<num><sp>$[][<\/p> ]
(we can't simply split on spaces, some words include spaces, and conversely, punctuation does not require spaces)
Tasks[edit]
- Make deformatters include a list of inline tags, and disperse these to the words covered by them.
- prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code
- Make lt-proc correctly disperse inline blanks onto each lexical unit until the next
[
- Not done yet
- Make pretransfer disperse tags when splitting lexical units
- Make transfer output the non-inline blanks before the rule output
- Make transfer handle inline-blanks, and ignore <b pos="N">
- work in progress for this and the above: https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c
- Make reformat turn inline-blanks back into real tags
- [{<i>}]foo [{<i><b>}]bar should become <i>foo</i> <i><b>bar</b></i>
- prototypes exist for this in https://github.com/junaidiiith/apertium / https://github.com/junaidiiith/Apertium_Code
- Ensure all other modules are fine with the new format for inline blanks
Coding challenges[edit]
deformatting[edit]
- Make the HTML format handler
apertium-deshtml
turn "<i>foo <b>bar</b></i>" into "[{<i>}]foo [{<i><b>}]bar"- The current way of creating apertium-deshtml from an xml specification run through xsltproc and flex is not likely to be used – if you don't want to mess with that, we recommend you start a new file apertium_deshtml2.cc and link to something like libgumbo1. This'll make your coding challenge something you can build on in the project itself.
- If you've completed 1., make
apertium-deshtml
*not* wrap tags like<p>
or<div>
in{}
(ie. only for inline tags)
pretransfer[edit]
- Code cleanup:
- Fork https://github.com/unhammer/apertium and check out and compile the
master
branch - then in a different folder, do
git clone -b blank-handling https://github.com/junaidiiith/apertium
- from junaidiiith/blank-handling, copy over the changes that were made there to apertium_pretransfer.cc into your fork of unhammer/apertium, along with the pretransfer tests
- ensure tests pass
- send a pull request to https://github.com/unhammer/apertium
- Fork https://github.com/unhammer/apertium and check out and compile the
transfer[edit]
- Build the first proof-of-concept:
- Compile
git clone -b blank-handling https://github.com/unhammer/apertium
- check out apertium-en-es from svn and compile it
- find some input to en-es that triggers a transfer rule that reorders words
- Manually change the input to apertium-transfer to have inline formatting around words
- and check if apertium-transfer keeps inline blanks on words, and phrase blanks outside the chunk. Show the input/output.
- Compile
- Find a bug
- If you didn't find anything amiss in the previous challenge, try triggering different transfer rules until you do (shouldn't take long)
- Check out master of https://github.com/junaidiiith/apertium and see if
git diff 374c46e90d7bd8494300dc364b00eb5a813ece79 --stat
contains a fix (most likely possibly in the Modifications directory)
- Fix a memory bug
- uncommenting apertium/transfer.cc:1259
// delete[] format;
in the blank handling branch leads to a double-free – find out why and ensure we're correctly releasing memory- Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run
valgrind -v --leak-check=full apertium/apertium-transfer
and read the output
- Install valgrind from your package manager or http://valgrind.org/, then compile your program with -O0 -g3, then run
- uncommenting apertium/transfer.cc:1259
- Apply changes to transfer.cc to interchunk.cc
- Check
git clone -b blank-handling https://github.com/unhammer/apertium
- Apply the diff (between that branch and master) from transfer.cc to interchunk.cc
- Try to make it compile – report things that didn't seem to have a 1-1 correspondence
- Check
Frequently asked questions[edit]
- none yet, ask us something! :)