User:Khannatanmai/GSoC2020 Final Report
This is the final report for the GSoC 2020 project initially titled "Modifying the apertium stream format and eliminating dictionary trimming", which ultimately became Modifying the apertium stream format to introduce wordbound blanks and solving the markup reordering problem.
Contents
- 1 The problem
- 2 The solution
- 3 Development
- 3.1 Transfer (Pull Request 1, Pull Request 2, Commit 1, Commit 2)
- 3.2 Recursive Transfer (Pull Request)
- 3.3 Pretransfer (Pull Request)
- 3.4 Separable (Pull Request)
- 3.5 Analysis, Biltrans, Generation (Pull Request)
- 3.6 HFST Analysis, Generation (Pull Request)
- 3.7 Streamparser (Pull Request)
- 3.8 Postgeneration (Pull Request)
- 3.9 Tagger (Pull Request)
- 4 Project Iterations
- 5 Previous Attempts
- 6 References
The problem
Format/Markup handling has been a problem in Apertium and is described in detail here. The way format handling was done is using superblanks that protect markup information inside superblanks during the translation process. While this works well to protect the formatting of the document, during translation, words/phrases move around, get deleted, split, merge, etc., and the markup information on the words needs to go through these with the words, otherwise we will end up with erroneous markup, which is what happened in Apertium.
$ echo "<i>El perro</i> <b>blanco</b>" | apertium-deshtml | apertium -f none -d . spa-eng | apertium-retxt <i>The white</i> <b>dog</b>
The solution
A new kind of blank was proposed - wordbound blank. This wblank contains any information that needs to stay attached to a word/phrase during the translation process. Right now this is markup information, but introducing wordbound blanks basically introduces alignments in Apertium and there are several use cases for information inside wordbound blanks.
Wordbound blanks will be denoted by double square brackets and will always appear right before the Lexical Unit to which they are attached.
[[wordboundblank]]^LU<tags>$
If there is no Lexical Unit in the stream (before the morph analyser and after the generator), then we have an end wblank as well.
[[wordboundblank]]word[[/]] word2 word3 [[wordboundblank]]word4[[/]]
The commands to translate formatted input were updated to use the modified modules and the new deformatters and reformatters.
$ echo "<i>El perro</i> <b>blanco</b>" | apertium -f html -d . spa-eng <i>The</i> <b>white</b> <i>dog</i>
Method
A new deformatter and reformatter was written, that converts markup into either superblanks or wordbound blanks, and each module in the entire pipeline was modified to work with wordbound blanks, such that in any module that moves, deletes, adds, merges, splits, etc. words and phrases will also do the same to corresponding wordbound blanks.
When wordbound blanks are on text that isn't tokenised using LUs, there's a closing wordbound blank to limit it's span. After the input is tokenised by the analyser, the program apertium-wblank-attach
attaches wordbound blanks to all the LUs inside the span of that wordbound blank and removes the closing wordbound blanks from the stream. Right before the generator, the opposite process happens with the apertium-wblank-detach
- closing wordbound blanks are put in the stream again in preparation for the reformatter.
Here's a snippet of the process:
Input: <i>El perro</i> <b>blanco</b> tf-extract: [[t:i:4_tPUA]]El perro[[/]] [[t:b:ls9z2Q]]blanco[[/]] Analyser (lt-proc -z): [[t:i:4_tPUA]]^El/El<det><def><m><sg>$ ^perro/perro<n><m><sg>$[[/]] [[t:b:ls9z2Q]]^blanco/blanco<n><m><sg>/blanco<adj><m><sg>$[[/]] apertium-wblank-attach: [[t:i:4_tPUA]]^El/El<det><def><m><sg>$ [[t:i:4_tPUA]]^perro/perro<n><m><sg>$ [[t:b:ls9z2Q]]^blanco/blanco<n><m><sg>/blanco<adj><m><sg>$ . . . Transfer Output: [[t:i:4_tPUA]]^The<det><def><sg>$ [[t:b:ls9z2Q]]^white<adj><sint>$ [[t:i:4_tPUA]]^dog<n><sg>$ apertium-wblank-detach: [[t:i:4_tPUA]]^The<det><def><sg>$[[/]] [[t:b:ls9z2Q]]^white<adj><sint>$[[/]] [[t:i:4_tPUA]]^dog<n><sg>$[[/]] Generator (lt-proc -g): [[t:i:4_tPUA]]The[[/]] [[t:b:ls9z2Q]]white[[/]] [[t:i:4_tPUA]]dog[[/]] tf-inject: <i>The</i> <b>white</b> <i>dog</i>
Here is the full pipeline:
$ echo "<i>El perro</i> <b>blanco</b>" | tf-extract | lt-proc -z 'spa-eng.automorf.bin'. | apertium-wblank-attach | apertium-tagger -z -g $2 'spa-eng.prob' | apertium-pretransfer -z | lt-proc -z -b 'spa-eng.autobil.bin' | lrx-proc -z -m 'spa-eng.autolex.bin' | apertium-transfer -z -b 'apertium-eng-spa.spa-eng.t1x' 'spa-eng.t1x.bin' | apertium-interchunk -z 'apertium-eng-spa.spa-eng.t2x' 'spa-eng.t2x.bin' | apertium-postchunk -z 'apertium-eng-spa.spa-eng.t3x' 'spa-eng.t3x.bin'. | apertium-wblank-detach | lt-proc -z -g 'spa-eng.autogen.bin' | lt-proc -z -p 'spa-eng.autopgen.bin' | tf-inject <i>The</i> <b>white</b> <i>dog</i>
Or, just run:
Development
Here is a list of changes made to the Apertium codebase for this project during GSoC 2020. The progress was recorded systematically during the summer here.
Transfer (Pull Request 1, Pull Request 2, Commit 1, Commit 2)
Chunker/Single-stage transfer
- Wordbound blanks are a part of transfer word as a new side: blank.
- Are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh is clipped from.
- If the lem/lemh comes from a variable in the output then the balnk come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
- If rule pattern has only one LU, the wordbound blank gets output with all output LUs of the rule
- When using apertium-transfer -n, the wblanks print as they're supposed to.
Interchunk
- No change needed as inter chunk doesn't access LUs inside the chunk.
Postchunk
- Wordbound blanks are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh/whole is clipped from.
- If the lem/lemh comes from a variable in the output then the blank comes from the LU which the lemma comes from, by tracing its variable assignment in .
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
- If rule pattern chunk has only one LU, the wordbound blank gets output with all output LUs of the rule
Recursive Transfer (Pull Request)
- Wordbound blanks are read as part of LUs as a new side->wblank.
- Wblanks reorder with the LUs in transfer based on where the lemma is clipped from.
- Works even if lemma is clipped into a variable and the variable is later added in the output.
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
Pretransfer (Pull Request)
- Wordbound blanks distribute across parts when compounds are split into individual LUs
Separable (Pull Request)
- Merge wordbound blanks and add to all LUs in rule output.
- Works for both autoseq and revautoseq.
Analysis, Biltrans, Generation (Pull Request)
- Parsing wordbound blanks as normal blanks for analysis, generation, biltrans.
- Added a test for wordbound blank analysis.
HFST Analysis, Generation (Pull Request)
- Parsing wordbound blanks as normal blanks for analysis and generation in hfst-proc.
Streamparser (Pull Request)
- Wordbound blanks parsed as part of a lexical unit in the stream parser.
- Can be accessed by class member:
LexicalUnit.wordbound_blank
.
Postgeneration (Pull Request)
- Wordbound blanks merge when words merge.
- Wordbound blanks apply to all output words when output of postgen rule are more than input words.
- No regression for postgeneration without wordbound blanks.
- Lots of tests added.
Tagger (Pull Request)
- Parse wblanks as normal blanks
Project Iterations
This project went through several iterations to become what it ultimately did.
- Proposal: Modifying the apertium stream format and eliminating dictionary trimming: User:Khannatanmai/GSoC2020Proposal_Trimming
- Development of the stream extension: User:Khannatanmai/New_Apertium_stream_format
- Eliminating Dictionary Trimming: User:Khannatanmai/Eliminating_Dictionary_Trimming
- Documentation of features related to secondary tags: User:Khannatanmai/Secondary_tags_features
- Development of the updated stream extension: User:Khannatanmai/Secondary_info_apertium_stream_format
Initially the proposal was focused on eliminating dictionary trimming using secondary tags. However, as is systematically documented here, the need for eliminating dictionary trimming wasn't agreed upon by the language developers, and secondary tags wasn't agreed upon as the best medium to do it. An alternate proposal was then floated, which uses wordbound blanks instead of secondary tags, and solving markup handling, which is considered a much bigger problem was made the focus instead. The development of these were recorded here.
Previous Attempts
- https://wiki.apertium.org/wiki/User:SilentFlame/Progress
- https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
- https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling
- https://github.com/junaidiiith/apertium
- https://github.com/junaidiiith/Apertium_Code
- https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c
References
- Reordering_superblanks
- Format_handling
- Ideas_for_Google_Summer_of_Code/Automatic_blank_handling
- Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm
- https://www.mediawiki.org/wiki/Content_translation/Markup#Annotation_mapping_using_translation_subsequence_approximation
- https://www.mediawiki.org/wiki/Content_translation/Developers/Markup
- https://www.mediawiki.org/wiki/Content_translation/Product_Definition/LinearDoc
- https://sourceforge.net/p/apertium/mailman/apertium-stuff/thread/20cf28cd0904300204v45f35e51i118f4d146f83748@mail.gmail.com/