Difference between revisions of "Ideas for Google Summer of Code/Automatic blank handling"

From Apertium
Jump to navigation Jump to search
Line 8: Line 8:
 
There is a solution: let the format handler treat some tags as "inline" (e.g. <em/>, <b/>) and others as "block-level" (e.g. <div>, <p>). Inline tags always stick with their words. If there are several words covered by one inline tag, they're duplicated to each word, and since they're "glued to the word", a transfer rule can move that word around without worrying about blanks. Block-level tags in the input to a transfer rule are always output before the output that a transfer rule gives; so a <p> will never be moved around by a rule.
 
There is a solution: let the format handler treat some tags as "inline" (e.g. <em/>, <b/>) and others as "block-level" (e.g. <div>, <p>). Inline tags always stick with their words. If there are several words covered by one inline tag, they're duplicated to each word, and since they're "glued to the word", a transfer rule can move that word around without worrying about blanks. Block-level tags in the input to a transfer rule are always output before the output that a transfer rule gives; so a <p> will never be moved around by a rule.
   
This GsoC project involves changing the whole Apertium pipeline (''lttoolbox, apertium, apertium-lex-tools, and if you have time also [[Constraint Grammar|CG-3]] and [[HFST]]'') to do this new automatic blank handling, in particular deformatters and the the transfer module(s). A [[User:Junzay/Blank handling prototype]] of this work was implemented in 2016, but there is still much work that remains.
+
This GsoC project involves changing the whole Apertium pipeline (''lttoolbox, apertium, apertium-lex-tools, and if you have time also [[Constraint Grammar|CG-3]] and [[HFST]]'') to do this new automatic blank handling, in particular deformatters and the the transfer module(s). A [[User:Junzay/Blank handling|prototype]] of this work was implemented in 2016, but there is still much work that remains.
   
 
Read [[Format handling]] and [[Apertium stream format]] for background, then [[Reordering superblanks]] for some more explanations of the problem and solution.
 
Read [[Format handling]] and [[Apertium stream format]] for background, then [[Reordering superblanks]] for some more explanations of the problem and solution.

Revision as of 11:14, 6 February 2017

A superblank is something that we don't want to translate, but keep in the output, often things like formatting tags. Currently there is a major problem with how formatting / superblanks interacts with word/chunk reordering in Apertium. Transfer rules can reorder words, but never look at the tags themselves, and have no general way of allowing the tags themselves to be reordered along with the words, without potentially messing up the resulting code (e.g. putting an end tag before a start tag).

There is a solution: let the format handler treat some tags as "inline" (e.g. <em/>, <b/>) and others as "block-level" (e.g. <div>, <p>). Inline tags always stick with their words. If there are several words covered by one inline tag, they're duplicated to each word, and since they're "glued to the word", a transfer rule can move that word around without worrying about blanks. Block-level tags in the input to a transfer rule are always output before the output that a transfer rule gives; so a <p> will never be moved around by a rule.

This GsoC project involves changing the whole Apertium pipeline (lttoolbox, apertium, apertium-lex-tools, and if you have time also CG-3 and HFST) to do this new automatic blank handling, in particular deformatters and the the transfer module(s). A prototype of this work was implemented in 2016, but there is still much work that remains.

Read Format handling and Apertium stream format for background, then Reordering superblanks for some more explanations of the problem and solution.


Tasks

  • Make deformatters include a list of inline tags, and disperse these to the words covered by them.
  • Make pretransfer disperse tags when splitting lexical units
  • Make transfer output the non-inline blanks before the rule output
  • Make transfer handle inline-blanks, and ignore <b pos="N">
  • Make reformat turn inline-blanks back into real tags
    • [{<i>}]foo [{<i><b>}]bar should become <i>foo</i> <i><b>bar</b></i>
  • Ensure all other modules are fine with the new format for inline blanks

Coding challenge

  1. Make the HTML format handler apertium-deshtml turn "<i>foo <b>bar</b></i>" into "[{<i>}]foo [{<i><b>}]bar"
  2. If you've completed 1., make apertium-deshtml *not* wrap tags like <p> or <div> in {} (ie. only for inline tags)

Frequently asked questions

  • none yet, ask us something! :)

See also