User:Khannatanmai/Wordbound blanks

From Apertium
Jump to navigation Jump to search

This page will follow the development of word bound blanks in the apertium stream format.

Features

Rationale

Wordbound blanks will store information about a lexical unit that can help us with several applications where we want to send information through the pipeline but this information can't be sent as tags because it would break the FST matching in the modules.

Formalism

Wordbound blanks will be denoted by double square brackets and will always appear right before a Lexical Unit.

[[wordboundblank]]^LU<tags>$

Examples

Markup Handling

$ echo 'legal <b>persons</b>' | apertium en-es -f html
Personas <b>legales</b>

Ideal:
<b>Personas</b> legales

$ echo 'I <b>am</b> David' | apertium en-es -f html
Soy</b> David

Ideal:
<b>Soy</b> David
Spanish: <p>Es <s>además</s> de Valencia.</p>
Catalan: <p>És <s>a més</s> de València.</p>
English: <p>The <b>big <i>red</i></b> dog</p>
Spanish: <p>El perro <b><i>rojo</i> grande</b></p>
<p>Bees <b>cannot</b> swim</p>
<p>Las Abejas <b>no pueden</b> nadar</p>
<a href="Conway">Conway</a> stated that young <a href="children">children</a>
<i>“understand <a href="Object_permanence">object permanence</a>.
<a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in
their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span>
<b>(<a href="Nielsen">Nielsen</a> equivalence).</b>
<p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p>
<a id="foobar" href="http://example.com">Foo <b>bar</b>.</a>

Ideal Output:
<a id="foobar" href="http://example.com"><b>Бар</b> фоо.</a>
<b>The</b> <i>sister</i>'s <em>dog</em>

From [[1]]

source: '<p>A <b>Japanese</b> <i>BBC</i> article</p>',
target: '<p>Un artículo de <i>BBC</i> <b>japonés</b></p>',

source: '<div>A <b>modern</b> Britain.</div>',
target: '<div>Una Gran Bretaña <b>moderna</b>.</div>',

source: '<p>The <b>big <i>red</i></b> dog</p>',
target: '<p>El perro <b><i>rojo</i></b> <b>grande</b></p>',

source: '<p>He said "<i>I tile <a href="x">bathrooms</a>.</i>"</p>',
target: '<p>Diga que "<i>enladrillo</i> <i><a href="x">baños</a></i>."</p>',

source: '<p>The <b>big red</b> dog</p>',
target: '<p>El perro <b>rojo grande</b></p>',

source: '<p>The <b>big</b> <b>red</b> dog</p>',
target: '<p>El perro <b>rojo</b> <b>grande</b></p>',

source: '<p>The <a href="1">big</a> <a href="2">red</a> dog</p>',
target: '<p>El perro <a href="2">rojo</a> <a href="1">grande</a></p>',
		
source: '<p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, which has an <b>executive editor</b> over the news pages and an <b>editorial page editor</b> over opinion pages.</span></p>',
4c508d7f6e64	
target: '<p id="8"><span data-segmentid="9" class="cx-segment"><a title="The New York Times" rel="mw:WikiLink" href="./The_New_York_Times" data-linkid="17" class="cx-link">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p>',

# Tino says: There's no text. This would never even reach the pipe.
source: '<p id="8"><style>b{color:red;}</style></p>',
target: '<p id="8"><style>b{color:red;}</style></p>',	

Pretransfer Tests:

input: [[<i>]]^a<vblex><pres>+c<po># b$ ^a<vblex><pres>+c<po># b$
output:[[<i>]]^a# b<vblex><pres>$ [[<i>]]^c<po>$ ^a# b<vblex><pres>$ ^c<po>$

Tests

Input:
The [[t:b:qum2bhp]]big [[t:b:qum2bhp; t:i:M0JZW3Q]]red dog[]

Transfer Input:
^The<det><def><sp>/El<det><def><GD><ND>$ [[t:b:qum2bhp]]^big<adj><sint>/grande<adj><mf>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^red<adj>/rojo<adj>$ ^dog<n><sg>/perro<n><GD><sg>$

Transfer Output:
^Det_nom_adj_adj<SN><DET><GD><sg>{^el<det><def><3><4>$ [[t:b:qum2bhp]]^perro<n><3><4>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj><3><4>$ ^grande<adj><mf><4>$}$

Previous Attempts

References