User:Khannatanmai/Wordbound blanks
Jump to navigation
Jump to search
This page will follow the development of word bound blanks in the apertium stream format.
Contents
Features
Transfer (Pull Request)
Chunker
- Wordbound blanks are a part of transfer word as a new side: blank.
- Are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh is clipped from.
- If the lem/lemh comes from a variable in the output then the balnk come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
Interchunk
- No change needed as inter chunk doesn't access LUs inside the chunk.
Postchunk
- Wordbound blanks are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh/whole is clipped from.
- If the lem/lemh comes from a variable in the output then the blank comes from the LU which the lemma comes from, by tracing its variable assignment in .
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
Rationale
Wordbound blanks will store information about a lexical unit that can help us with several applications where we want to send information through the pipeline but this information can't be sent as tags because it would break the FST matching in the modules.
Formalism
Wordbound blanks will be denoted by double square brackets and will always appear right before a Lexical Unit.
[[wordboundblank]]^LU<tags>$
Examples
Markup Handling
Working Examples
Transfer Input: ^The<det><def><sp>/El<det><def><GD><ND>$ [[tbqum2bhp]]^big<adj><sint>/grande<adj><mf>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^red<adj>/rojo<adj>$ ^dog<n><sg>/perro<n><GD><sg>$[ ] Transfer Output: ^El<det><def><m><sg>$ ^perro<n><m><sg>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj><m><sg>$ [[tbqum2bhp]]^grande<adj><mf><sg>$[ ]
Postchunk Input: ^Det_adj<SA>{^el<det><def>$ [[t:b:qum2bhp]]^grande# test<adj>$}$ ^inf<SV><vblex><pres><p3><ND>{[[t:i:M0JZW3Q]]^vivir<vblex><3>$}$ ^default<default>{[[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj>$}$ ^nom<SN><sg>{^perro<n><3>$}$ ^nom<SN><sg>{[[t:s:123456]]^test<n><3># abc$}$ ^have_enc_pp<SV><tx><tps><PD><ND>{[[t:x:1234ab]]^xyz<cnjadv>$ [[t:s:p2rthg]]^abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$}$ ^have_enc_pp<SV><tx><tps><PD><ND>{[[t:x:1234ab; t:y:poposj]]^xyz<cnjadv>$ [[t:s:p2rthg; t:b:123456]]^abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$}$[ ] Postchunk Output: ^El<det><def>$ [[t:b:qum2bhp]]^grande# test<adj>$ [[t:i:M0JZW3Q]]^vivir<vblex><pres><p3><ND>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj>$ ^perro<n>$ [[t:s:123456]]^test<n># abc$ [[t:x:1234ab; t:s:p2rthg]]^xyz<cnjadv>+abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$ [[t:x:1234ab; t:y:poposj; t:s:p2rthg; t:b:123456]]^xyz<cnjadv>+abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$[ ]
Examples that should work
$ echo 'legal <b>persons</b>' | apertium en-es -f html Personas <b>legales</b> Ideal: <b>Personas</b> legales $ echo 'I <b>am</b> David' | apertium en-es -f html Soy</b> David Ideal: <b>Soy</b> David
Spanish: <p>Es <s>además</s> de Valencia.</p> Catalan: <p>És <s>a més</s> de València.</p>
<p>Bees <b>cannot</b> swim</p> <p>Las Abejas <b>no pueden</b> nadar</p>
<a href="Conway">Conway</a> stated that young <a href="children">children</a> <i>“understand <a href="Object_permanence">object permanence</a>. <a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span> <b>(<a href="Nielsen">Nielsen</a> equivalence).</b>
<p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p>
<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a> Ideal Output: <a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>
<b>The</b> <i>sister</i>'s <em>dog</em>
From [[1]]
source: '<p>A <b>Japanese</b> <i>BBC</i> article</p>', target: '<p>Un artículo de <i>BBC</i> <b>japonés</b></p>', source: '<div>A <b>modern</b> Britain.</div>', target: '<div>Una Gran Bretaña <b>moderna</b>.</div>', source: '<p>The <b>big <i>red</i></b> dog</p>', target: '<p>El perro <b><i>rojo</i></b> <b>grande</b></p>', source: '<p>He said "<i>I tile <a href="x">bathrooms</a>.</i>"</p>', target: '<p>Diga que "<i>enladrillo</i> <i><a href="x">baños</a></i>."</p>', source: '<p>The <b>big red</b> dog</p>', target: '<p>El perro <b>rojo grande</b></p>', source: '<p>The <b>big</b> <b>red</b> dog</p>', target: '<p>El perro <b>rojo</b> <b>grande</b></p>', source: '<p>The <a href="1">big</a> <a href="2">red</a> dog</p>', target: '<p>El perro <a href="2">rojo</a> <a href="1">grande</a></p>', source: '<p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, which has an <b>executive editor</b> over the news pages and an <b>editorial page editor</b> over opinion pages.</span></p>', 4c508d7f6e64 target: '<p id="8"><span data-segmentid="9" class="cx-segment"><a title="The New York Times" rel="mw:WikiLink" href="./The_New_York_Times" data-linkid="17" class="cx-link">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p>', # Tino says: There's no text. This would never even reach the pipe. source: '<p id="8"><style>b{color:red;}</style></p>', target: '<p id="8"><style>b{color:red;}</style></p>',
Pretransfer Tests:
input: [[<i>]]^a<vblex><pres>+c<po># b$ ^a<vblex><pres>+c<po># b$ output:[[<i>]]^a# b<vblex><pres>$ [[<i>]]^c<po>$ ^a# b<vblex><pres>$ ^c<po>$
Tests
Input: The [[t:b:qum2bhp]]big [[t:b:qum2bhp; t:i:M0JZW3Q]]red dog[] Transfer Input: ^The<det><def><sp>/El<det><def><GD><ND>$ [[t:b:qum2bhp]]^big<adj><sint>/grande<adj><mf>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^red<adj>/rojo<adj>$ ^dog<n><sg>/perro<n><GD><sg>$ Transfer Output: ^Det_nom_adj_adj<SN><DET><GD><sg>{^el<det><def><3><4>$ [[t:b:qum2bhp]]^perro<n><3><4>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj><3><4>$ ^grande<adj><mf><4>$}$
Previous Attempts
- https://wiki.apertium.org/wiki/User:SilentFlame/Progress
- https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
- https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling
- https://github.com/junaidiiith/apertium
- https://github.com/junaidiiith/Apertium_Code
- Make transfer output the non-inline blanks before the rule output AND Make transfer handle inline-blanks, and ignore :: work in progress for this and the above: https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c
References