Difference between revisions of "User:Khannatanmai/Wordbound blanks"
Khannatanmai (talk | contribs) |
Khannatanmai (talk | contribs) |
||
(25 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
= Features = |
= Features = |
||
== Transfer ([https://github.com/apertium/apertium/pull/90 Pull Request 1], [https://github.com/apertium/apertium/pull/94 Pull Request 2]) == |
== Transfer ([https://github.com/apertium/apertium/pull/90 Pull Request 1], [https://github.com/apertium/apertium/pull/94 Pull Request 2], [https://github.com/apertium/apertium/commit/4b4d930011c080ae1aceec044402141a2635c0d3 Commit 1], [https://github.com/apertium/apertium/commit/ad78aa15b8101449e4d38430710e621ff1206378 Commit 2], [https://github.com/apertium/apertium/commit/c410a1bb39d5dd436d63c6f68f57b162056d8710 Commit 3], [https://github.com/apertium/apertium/pull/102 Pull Request 3], [https://github.com/apertium/apertium/commit/e67343b1d8843689198db4fd282a4357109a2664 Commit 4]) == |
||
=== Chunker/Single-stage transfer === |
=== Chunker/Single-stage transfer === |
||
* Wordbound blanks are a part of transfer word as a new side: blank. |
* Wordbound blanks are a part of transfer word as a new side: blank. |
||
Line 14: | Line 14: | ||
* Tests added |
* Tests added |
||
* If rule pattern has only one LU, the wordbound blank gets output with all output LUs of the rule |
* If rule pattern has only one LU, the wordbound blank gets output with all output LUs of the rule |
||
* When using apertium-transfer -n, the wblanks print as they're supposed to. |
|||
* Fix null flushing in transfer |
|||
* Store blanks in queue and output wherever user has <pre></b></pre> in the rule output, so that users don't have to define a blank position anymore. (Blank reordering is done by wordbound blanks now) |
|||
=== Interchunk === |
=== Interchunk === |
||
* No change needed as inter chunk doesn't access LUs inside the chunk. |
* No change needed as inter chunk doesn't access LUs inside the chunk. |
||
* Blank handling changed so the user doesn't have to worry about the blank position anymore. |
|||
=== Postchunk === |
=== Postchunk === |
||
Line 27: | Line 31: | ||
* Tests added |
* Tests added |
||
* If rule pattern chunk has only one LU, the wordbound blank gets output with all output LUs of the rule |
* If rule pattern chunk has only one LU, the wordbound blank gets output with all output LUs of the rule |
||
* Blank handling changed so the user doesn't have to worry about the blank position anymore. |
|||
== Recursive Transfer ([https://github.com/apertium/apertium-recursive/pull/65 Pull Request]) == |
|||
* Wordbound blanks are read as part of LUs as a new side->wblank. |
|||
* Wblanks reorder with the LUs in transfer based on where the lemma is clipped from. |
|||
* Works even if lemma is clipped into a variable and the variable is later added in the output. |
|||
* No regression. Stream without wordbound blanks work as-is. |
|||
* Normal blanks don't move around while wordbound blanks move around. |
|||
* When MLUs are formed the blanks are merged. |
|||
* Tests added |
|||
== Pretransfer ([https://github.com/apertium/apertium/pull/93 Pull Request]) == |
== Pretransfer ([https://github.com/apertium/apertium/pull/93 Pull Request]) == |
||
Line 38: | Line 52: | ||
* Parsing wordbound blanks as normal blanks for analysis, generation, biltrans. |
* Parsing wordbound blanks as normal blanks for analysis, generation, biltrans. |
||
* Added a test for wordbound blank analysis. |
* Added a test for wordbound blank analysis. |
||
== HFST Analysis, Generation ([https://github.com/hfst/hfst/pull/478 Pull Request]) == |
|||
* Parsing wordbound blanks as normal blanks for analysis and generation in hfst-proc. |
|||
== Streamparser ([https://github.com/apertium/streamparser/pull/37 Pull Request]) == |
== Streamparser ([https://github.com/apertium/streamparser/pull/37 Pull Request]) == |
||
Line 43: | Line 60: | ||
* Can be accessed by class member: <code>LexicalUnit.wordbound_blank</code>. |
* Can be accessed by class member: <code>LexicalUnit.wordbound_blank</code>. |
||
== Postgeneration ([https://github.com/apertium/lttoolbox/pull/102 Pull Request]) == |
== Postgeneration ([https://github.com/apertium/lttoolbox/pull/102 Pull Request], [https://github.com/apertium/lttoolbox/commit/3706d959b657b53dd094a75ac336e24a8f1739b2 Commit 1], [https://github.com/apertium/lttoolbox/commit/61d7c0d8f8e4ab4a5ad33f44db4dea60dc0a2422 Commit 2]) == |
||
* Wordbound blanks merge when words merge. |
* Wordbound blanks merge when words merge. |
||
* Wordbound blanks apply to all output words when output of postgen rule are more than input words. |
* Wordbound blanks apply to all output words when output of postgen rule are more than input words. |
||
* No regression for postgeneration without wordbound blanks. |
* No regression for postgeneration without wordbound blanks. |
||
* Lots of tests added. |
* Lots of tests added. |
||
== Tagger ([https://github.com/apertium/apertium/pull/98 Pull Request]) == |
|||
* Parse wblanks as normal blanks |
|||
= Rationale = |
= Rationale = |
||
Line 63: | Line 83: | ||
<pre>[[wordboundblank]]word[[/]] word2 word3 [[wordboundblank]]word4[[/]]</pre> |
<pre>[[wordboundblank]]word[[/]] word2 word3 [[wordboundblank]]word4[[/]]</pre> |
||
= Examples = |
|||
== Markup Handling == |
|||
=== Working Examples === |
|||
==== full pipe ==== |
|||
= Full Pipe Testing = |
|||
==== lt-proc to postchunk ==== |
|||
<pre> |
|||
Transfer Input: |
|||
^The<det><def><sp>/El<det><def><GD><ND>$ [[tbqum2bhp]]^big<adj><sint>/grande<adj><mf>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^red<adj>/rojo<adj>$ ^dog<n><sg>/perro<n><GD><sg>$[ |
|||
] |
|||
Current Translation Command: <code>apertium-deshtml < html_input-eng.in | apertium -f none -d $PREFIX/apertium-eng-spa eng-spa | apertium-retxt</code> |
|||
Transfer Output: |
|||
^El<det><def><m><sg>$ ^perro<n><m><sg>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj><m><sg>$ [[tbqum2bhp]]^grande<adj><mf><sg>$[ |
|||
] |
|||
</pre> |
|||
<pre> |
|||
Postchunk Input: |
|||
^Det_adj<SA>{^el<det><def>$ [[t:b:qum2bhp]]^grande# test<adj>$}$ ^inf<SV><vblex><pres><p3><ND>{[[t:i:M0JZW3Q]]^vivir<vblex><3>$}$ ^default<default>{[[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj>$}$ ^nom<SN><sg>{^perro<n><3>$}$ ^nom<SN><sg>{[[t:s:123456]]^test<n><3># abc$}$ ^have_enc_pp<SV><tx><tps><PD><ND>{[[t:x:1234ab]]^xyz<cnjadv>$ [[t:s:p2rthg]]^abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$}$ ^have_enc_pp<SV><tx><tps><PD><ND>{[[t:x:1234ab; t:y:poposj]]^xyz<cnjadv>$ [[t:s:p2rthg; t:b:123456]]^abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$}$[ |
|||
] |
|||
Postchunk Output: |
|||
^El<det><def>$ [[t:b:qum2bhp]]^grande# test<adj>$ [[t:i:M0JZW3Q]]^vivir<vblex><pres><p3><ND>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj>$ ^perro<n>$ [[t:s:123456]]^test<n># abc$ [[t:x:1234ab; t:s:p2rthg]]^xyz<cnjadv>+abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$ [[t:x:1234ab; t:y:poposj; t:s:p2rthg; t:b:123456]]^xyz<cnjadv>+abc<vbhaver><ger>$ [[t:x:y265hk]]^uvwx<vblex><pp>$[ |
|||
] |
|||
</pre> |
|||
<pre> |
|||
*********** |
|||
lt-proc output: |
|||
^legal/legal<adj>$ ^persons/person<n><pl>$[] |
|||
postchunk output: |
|||
^Persona<n><f><pl>$ ^legal<adj><mf><pl>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^legal/legal<adj>$ [[t:b:e4XkhY]]^persons/person<n><pl>$[] |
|||
postchunk output: |
|||
[[t:b:e4XkhY]]^persona<n><f><pl>$ ^legal<adj><mf><pl>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^am/be<vbser><pri><p1><sg>$ ^David/David<np><ant><m><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^ser<vbser><pri><p1><sg>$ ^David<np><ant><m><sg>$ ^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:b:Steu7o1]]^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ [[t:b:Steu7o2]]^am/be<vbser><pri><p1><sg>$ ^David/David<np><ant><m><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
[[t:b:Steu7o2]]^Ser<vbser><pri><p1><sg>$ ^David<np><ant><m><sg>$ ^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ [[t:b:Steu7o1]]^am/be<vbser><pri><p1><sg>$ [[t:b:Steu7o2]]^David/David<np><ant><m><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
[[t:b:Steu7o1]]^Ser<vbser><pri><p1><sg>$ [[t:b:Steu7o2]]^David<np><ant><m><sg>$ ^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:b:Steu7o1]]^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^am/be<vbser><pri><p1><sg>$ [[t:b:Steu7o2]]^David/David<np><ant><m><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Ser<vbser><pri><p1><sg>$ [[t:b:Steu7o2]]^David<np><ant><m><sg>$ ^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^Bees/Bee<n><pl>$ ^cannot/can<vaux><pres>+not<adv>$ ^swim/swim<vblex><inf>/swim<vblex><pres>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><f><pl>$ ^abeja<n><f><pl>$ ^no<adv>$ ^poder<vbmod><pri><p3><pl>$ ^nadar<vblex><inf>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^Bees/Bee<n><pl>$ [[t:i:NaFC2iv]]^cannot/can<vaux><pres>+not<adv>$ ^swim/swim<vblex><inf>/swim<vblex><pres>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><f><pl>$ ^abeja<n><f><pl>$ [[t:i:NaFC2iv]]^no<adv>$ [[t:i:NaFC2iv]]^poder<vbmod><pri><p3><pl>$ ^nadar<vblex><inf>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^Conway/*Conway$ ^stated/state<vblex><past>/state<vblex><pp>$ ^that/that<cnjsub>/that<det><dem><sg>/that<prn><tn><mf><sg>/that<rel><an><mf><sp>$ ^young/young<adj><sint>$ ^children/child<n><pl>$ "^understand/understand<vblex><inf>/understand<vblex><pres>$ ^object/object<n><sg>/object<vblex><inf>/object<vblex><pres>$ ^permanence/permanence<n><sg>$^./.<sent>$ ^Concealed/Conceal<vblex><past>/Conceal<vblex><pp>$ ^objects/object<n><pl>/object<vblex><pri><p3><sg>$ ^feature/feature<n><sg>/feature<vblex><inf>/feature<vblex><pres>$ ^in/in<pr>$ ^their/their<det><pos><sp>$ ^awareness/awareness<n><sg>$^./.<sent>$" ^(/(<lpar>$^Nielsen/*Nielsen$ ^equivalence/equivalence<n><sg>$^)/)<rpar>$^./.<sent>$[] |
|||
postchunk output: |
|||
^*Conway$ ^Declarar<vblex><ifi><p3><sg>$ ^que<cnjsub>$ ^el<det><def><m><pl>$ ^niño<n><m><pl>$ ^joven<adj><mf><pl>$ "^entender<vblex><pri><p3><pl>$ ^permanencia<n><f><sg>$ ^de<pr>$ ^objeto<n><m><sg>$^.<sent>$ ^Encubrir<vblex><pp><m><sg>$ ^objetar<vblex><pri><p3><sg>$ ^característica<n><f><sg>$ ^en<pr>$ ^suyo<det><pos><mf><sg>$ ^concienciación<n><f><sg>$^.<sent>$" ^(<lpar>$^*Nielsen$ ^Equivalencia<n><f><sg>$^)<rpar>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:a:NaFC2iv]]^Conway/*Conway$ ^stated/state<vblex><past>/state<vblex><pp>$ ^that/that<cnjsub>/that<det><dem><sg>/that<prn><tn><mf><sg>/that<rel><an><mf><sp>$ ^young/young<adj><sint>$ ^children/child<n><pl>$ "[[t:i:M0JZW3Q]]^understand/understand<vblex><inf>/understand<vblex><pres>$ [[t:i:M0JZW3Q1; t:a:qN1fD2pi1]]^object/object<n><sg>/object<vblex><inf>/object<vblex><pres>$ [[t:i:M0JZW3Q2; t:a:qN1fD2pi2]]^permanence/permanence<n><sg>$[[t:i:M0JZW3Q3]]^./.<sent>$ [[t:i:M0JZW3Q4; t:a:ZVcC0MJ]]^Concealed/Conceal<vblex><past>/Conceal<vblex><pp>$ [[t:i:M0JZW3Q5; t:a:xDp3Y3y]]^objects/object<n><pl>/object<vblex><pri><p3><sg>$ [[t:i:M0JZW3Q6]]^feature/feature<n><sg>/feature<vblex><inf>/feature<vblex><pres>$ [[t:i:M0JZW3Q7]]^in/in<pr>$ [[t:i:M0JZW3Q8]]^their/their<det><pos><sp>$ [[t:i:M0JZW3Q9]]^awareness/awareness<n><sg>$[[t:i:M0JZW3Q10]]^./.<sent>$" ^(/(<lpar>$^Nielsen/*Nielsen$ ^equivalence/equivalence<n><sg>$^)/)<rpar>$^./.<sent>$[] |
|||
postchunk output: |
|||
[[t:a:NaFC2iv]]^*Conway$ ^Declarar<vblex><ifi><p3><sg>$ ^que<cnjsub>$ ^el<det><def><m><pl>$ ^niño<n><m><pl>$ ^joven<adj><mf><pl>$ "[[t:i:M0JZW3Q]]^entender<vblex><pri><p3><pl>$ [[t:i:M0JZW3Q2; t:a:qN1fD2pi2]]^permanencia<n><f><sg>$ ^de<pr>$ [[t:i:M0JZW3Q1; t:a:qN1fD2pi1]]^objeto<n><m><sg>$[[t:i:M0JZW3Q3]]^.<sent>$ [[t:i:M0JZW3Q4; t:a:ZVcC0MJ]]^Encubrir<vblex><pp><m><sg>$ [[t:i:M0JZW3Q5; t:a:xDp3Y3y]]^objetar<vblex><pri><p3><sg>$ [[t:i:M0JZW3Q6]]^característica<n><f><sg>$ [[t:i:M0JZW3Q7]]^en<pr>$ [[t:i:M0JZW3Q8]]^suyo<det><pos><mf><sg>$ [[t:i:M0JZW3Q9]]^concienciación<n><f><sg>$[[t:i:M0JZW3Q10]]^.<sent>$" ^(<lpar>$^*Nielsen$ ^Equivalencia<n><f><sg>$^)<rpar>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^My/My<det><pos><sp>$ ^sister/sister<n><sg>$ ^lives/life<n><pl>/live<vblex><pri><p3><sg>$ ^in/in<pr>$ ^Wales/Wales<np><loc><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Mío<det><pos><mf><pl>$ ^vida<n><f><pl>$ ^de<pr>$ ^hermano<n><f><sg>$ ^en<pr>$ ^Gales<np><loc><m><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:b:qum2bhp1]]^My/My<det><pos><sp>$ [[t:b:qum2bhp2; t:i:KPL7B551]]^sister/sister<n><sg>$ [[t:b:qum2bhp3; t:i:KPL7B552]]^lives/life<n><pl>/live<vblex><pri><p3><sg>$ [[t:u:WyW2HW1]]^in/in<pr>$ [[t:u:WyW2HW2]]^Wales/Wales<np><loc><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
[[t:b:qum2bhp1]]^Mío<det><pos><mf><pl>$ [[t:b:qum2bhp3; t:i:KPL7B552]]^vida<n><f><pl>$ ^de<pr>$ [[t:b:qum2bhp2; t:i:KPL7B551]]^hermano<n><f><sg>$ [[t:u:WyW2HW1]]^en<pr>$ [[t:u:WyW2HW2]]^Gales<np><loc><m><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^The/The<det><def><sp>$ ^sister/sister<n><sg>$ ^'s/'s<gen>/be<vbser><pri><p3><sg>$ ^dog/dog<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><m><sg>$ ^perro<n><m><sg>$ ^de<pr>$ ^el<det><def><f><sg>$ ^hermano<n><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:b:8gaY]]^The/The<det><def><sp>$ [[t:i:QypP0e]]^sister/sister<n><sg>$ ^'s/'s<gen>/be<vbser><pri><p3><sg>$ ^dog/dog<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><m><sg>$ ^perro<n><m><sg>$ ^de<pr>$ [[t:b:8gaY]]^el<det><def><f><sg>$ [[t:i:QypP0e]]^hermano<n><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^A/A<det><ind><sg>$ ^Japanese/japanese<adj>/Japanese<n><sg>/Japanese<n><pl>$ ^BBC/BBC<n><acr><sg>$ ^article/article<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Uno<det><ind><f><sg>$ ^prenda<n><f><sg>$ ^de<pr>$ ^BBC<n><acr><f><sg>$ ^japonés<adj><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^A/A<det><ind><sg>$ [[t:b:qum2bhp]]^Japanese/japanese<adj>/Japanese<n><sg>/Japanese<n><pl>$ [[t:i:M0JZW3Q]]^BBC/BBC<n><acr><sg>$ ^article/article<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Uno<det><ind><f><sg>$ ^prenda<n><f><sg>$ ^de<pr>$ [[t:i:M0JZW3Q]]^BBC<n><acr><f><sg>$ [[t:b:qum2bhp]]^japonés<adj><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^A/A<det><ind><sg>$ ^modern/modern<adj>$ ^Britain/Britain<np><loc><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Uno<det><ind><f><sg>$ ^Gran Bretaña<np><loc><f><sg>$ ^moderno<adj><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^A/A<det><ind><sg>$ [[t:b:qum2bhp]]^modern/modern<adj>$ ^Britain/Britain<np><loc><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^Uno<det><ind><f><sg>$ ^Gran Bretaña<np><loc><f><sg>$ [[t:b:qum2bhp]]^moderno<adj><f><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^The/The<det><def><sp>$ ^big/big<adj><sint>$ ^red/red<adj>/red<n><sg>$ ^dog/dog<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><m><sg>$ ^perro<n><m><sg>$ ^rojo<adj><m><sg>$ ^grande<adj><mf><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^The/The<det><def><sp>$ [[t:b:qum2bhp1]]^big/big<adj><sint>$ [[t:b:qum2bhp2; t:i:M0JZW3Q]]^red/red<adj>/red<n><sg>$ ^dog/dog<n><sg>$^./.<sent>$[] |
|||
postchunk output: |
|||
^El<det><def><m><sg>$ ^perro<n><m><sg>$ [[t:b:qum2bhp2; t:i:M0JZW3Q]]^rojo<adj><m><sg>$ [[t:b:qum2bhp1]]^grande<adj><mf><sg>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
^He/Prpers<prn><subj><p3><m><sg>$ ^said/say<vblex><past>/say<vblex><pp>$ "^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ ^tile/tile<n><sg>/tile<vblex><inf>/tile<vblex><pres>$ ^bathrooms/bathroom<n><pl>$^./.<sent>$"[] |
|||
postchunk output: |
|||
^Decir<vblex><prs><p3><sg>$ "^I<num><mf><pl>$ ^baño<n><m><pl>$ ^de<pr>$ ^azulejo<n><m><sg>$^.<sent>$"[] |
|||
*********** |
|||
lt-proc output: |
|||
^He/Prpers<prn><subj><p3><m><sg>$ ^said/say<vblex><past>/say<vblex><pp>$ "[[t:i:M0JZW3Q1]]^I/I<num><mf><sg>/prpers<prn><subj><p1><mf><sg>$ [[t:i:M0JZW3Q2]]^tile/tile<n><sg>/tile<vblex><inf>/tile<vblex><pres>$ [[t:i:M0JZW3Q3; t:a:NaFC2iv]]^bathrooms/bathroom<n><pl>$[[t:i:M0JZW3Q4]]^./.<sent>$"[] |
|||
postchunk output: |
|||
^Decir<vblex><prs><p3><sg>$ "[[t:i:M0JZW3Q1]]^I<num><mf><pl>$ [[t:i:M0JZW3Q3; t:a:NaFC2iv]]^baño<n><m><pl>$ ^de<pr>$ [[t:i:M0JZW3Q2]]^azulejo<n><m><sg>$[[t:i:M0JZW3Q4]]^.<sent>$"[] |
|||
*********** |
|||
lt-proc output: |
|||
^The New York Times/The New York Times<np><al><sg>$^,/,<cm>$ ^which/which<det><itg><sp>/which<prn><itg><m><sp>/which<rel><an><mf><sp>$ ^has/have<vbhaver><pri><p3><sg>/have<vblex><pri><p3><sg>$ ^an/a<det><ind><sg>$ ^executive/executive<adj>/executive<n><sg>$ ^editor/editor<n><sg>$ ^over/over<adv>/over<pr>$ ^the/the<det><def><sp>$ ^news/news<adj>/news<n><sg>/news<n><pl>$ ^pages/page<n><pl>$ ^and/and<cnjcoo>$ ^an/a<det><ind><sg>$ ^editorial/editorial<n><sg>$ ^page/page<n><sg>$ ^editor/editor<n><sg>$ ^over/over<adv>/over<pr>$ ^opinion/opinion<n><sg>$ ^pages/page<n><pl>$^./.<sent>$^./.<sent>$[] |
|||
postchunk output: |
|||
^The New York Times<np><al><m><sg>$^,<cm>$ ^el cual<rel><nn><m><sg>$ ^tener<vblex><pri><p3><sg>$ ^uno<det><ind><m><sg>$ ^editor<n><m><sg>$ ^ejecutivo<adj><m><sg>$ ^sobre<pr>$ ^el<det><def><f><pl>$ ^página<n><f><pl>$ ^noticioso<adj><f><pl>$ ^y<cnjcoo>$ ^uno<det><ind><m><sg>$ ^editor<n><m><sg>$ ^de<pr>$ ^página<n><f><sg>$ ^de<pr>$ ^el<det><def><m><sg>$ ^editorial<n><m><sg>$ ^encima<adv>$ ^página<n><f><pl>$ ^de<pr>$ ^opinión<n><f><sg>$^.<sent>$^.<sent>$[] |
|||
*********** |
|||
lt-proc output: |
|||
[[t:a:ETwYHMW]]^The New York Times/The New York Times<np><al><sg>$^,/,<cm>$ ^which/which<det><itg><sp>/which<prn><itg><m><sp>/which<rel><an><mf><sp>$ ^has/have<vbhaver><pri><p3><sg>/have<vblex><pri><p3><sg>$ ^an/a<det><ind><sg>$ [[t:b:QjxgZ1]]^executive/executive<adj>/executive<n><sg>$ [[t:b:QjxgZ2]]^editor/editor<n><sg>$ ^over/over<adv>/over<pr>$ ^the/the<det><def><sp>$ ^news/news<adj>/news<n><sg>/news<n><pl>$ ^pages/page<n><pl>$ ^and/and<cnjcoo>$ ^an/a<det><ind><sg>$ [[t:b:QjxgZ3]]^editorial/editorial<n><sg>$ [[t:b:QjxgZ4]]^page/page<n><sg>$ [[t:b:QjxgZ5]]^editor/editor<n><sg>$ ^over/over<adv>/over<pr>$ ^opinion/opinion<n><sg>$ ^pages/page<n><pl>$^./.<sent>$^./.<sent>$[] |
|||
postchunk output: |
|||
[[t:a:ETwYHMW]]^The New York Times<np><al><m><sg>$^,<cm>$ ^el cual<rel><nn><m><sg>$ ^tener<vblex><pri><p3><sg>$ ^uno<det><ind><m><sg>$ [[t:b:QjxgZ2]]^editor<n><m><sg>$ [[t:b:QjxgZ1]]^ejecutivo<adj><m><sg>$ ^sobre<pr>$ ^el<det><def><f><pl>$ ^página<n><f><pl>$ ^noticioso<adj><f><pl>$ ^y<cnjcoo>$ ^uno<det><ind><m><sg>$ [[t:b:QjxgZ5]]^editor<n><m><sg>$ ^de<pr>$ [[t:b:QjxgZ4]]^página<n><f><sg>$ ^de<pr>$ ^el<det><def><m><sg>$ [[t:b:QjxgZ3]]^editorial<n><m><sg>$ ^encima<adv>$ ^página<n><f><pl>$ ^de<pr>$ ^opinión<n><f><sg>$^.<sent>$^.<sent>$[] |
|||
</pre> |
|||
Wordbound blank with Transfuse Command: <code>tf-html-fragment $PREFIX/apertium-eng-spa/modes/eng-spa.mode < html_input-eng.in</code> |
|||
=== Examples that should work === |
|||
== Spanish - Catalan == |
|||
<pre> |
<pre> |
||
Source: <p>Es <s>además</s> de Valencia.</p> |
Source: <p>Es <s>además</s> de Valencia.</p> |
||
Current Translation: <p>És <s>a més de</s> València.</p> |
Current Translation: <p>És <s>a més de</s> València.</p> |
||
Ideal Translation: <p>Es <s>además</s> de Valencia.</p> |
Ideal Translation: <p>Es <s>además</s> de Valencia.</p> |
||
After wordbound blanks: <p>És <s> |
After wordbound blanks: <p>És <s>a més</s> de València.</p> |
||
</pre> |
</pre> |
||
== Spanish - English == |
|||
<pre> |
<pre> |
||
Source: legal <b>persons</b> |
Source: legal <b>persons</b> |
||
Current Translation: Personas jurídicas <b></b> |
Current Translation: Personas jurídicas <b></b> |
||
Ideal Translation: <b>Personas</b> legales |
Ideal Translation: <b>Personas</b> legales |
||
After wordbound blanks: |
After wordbound blanks: <b>Personas</b> legales |
||
Note: Multiword not recognised because of multiple blanks between the words. Can be updated if needed. |
|||
Source: I <b>am</b> David |
Source: I <b>am</b> David |
||
Current Translation: <b>soy David</b> |
Current Translation: <b>soy David</b> |
||
Ideal Translation: <b>Soy</b> David |
Ideal Translation: <b>Soy</b> David |
||
After wordbound blanks: |
After wordbound blanks: <b>Soy</b> David |
||
Source: <p>Bees <b>cannot</b> swim</p> |
Source: <p>Bees <b>cannot</b> swim</p> |
||
Current Translation: <p>Las abejas <b>no pueden</b> nadar</p> |
Current Translation: <p>Las abejas <b>no pueden</b> nadar</p> |
||
Ideal Translation: <p>Las Abejas <b>no pueden</b> nadar</p> |
Ideal Translation: <p>Las Abejas <b>no pueden</b> nadar</p> |
||
After wordbound blanks: |
After wordbound blanks: <p>Las abejas <b>no pueden</b> nadar</p> |
||
Source: <a href="Conway">Conway</a> stated that young <a href="children">children</a><i>“understand <a href="Object_permanence">object permanence</a>. <a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">Nielsen</a> equivalence).</b> |
Source: <a href="Conway">Conway</a> stated that young <a href="children">children</a><i>“understand <a href="Object_permanence">object permanence</a>. <a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">Nielsen</a> equivalence).</b> |
||
Current Translation: <a href="Conway">*Conway</a> Declaró que los niños <a href="children">jóvenes</a><i>“entienden <a href="Object_permanence">permanencia de objeto</a>. <a href="Concealment">Encubierto</a> <a href="Object">objeta</a> característica en su concienciación.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">*Nielsen</a> equivalencia).</b> |
Current Translation: <a href="Conway">*Conway</a> Declaró que los niños <a href="children">jóvenes</a><i>“entienden <a href="Object_permanence">permanencia de objeto</a>. <a href="Concealment">Encubierto</a> <a href="Object">objeta</a> característica en su concienciación.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">*Nielsen</a> equivalencia).</b> |
||
Ideal Translation: |
Ideal Translation: |
||
After wordbound blanks: <a href="Conway">*Conway</a> Declaró que los <a href="children">niños</a> jóvenes“<i>entienden <a href="Object_permanence">permanencia</a></i> de <i> <a href="Object_permanence">objeto</a></i><i>. <a href="Concealment">Encubierto</a> </i> <i><a href="Object">objeta</a> </i> <i>característica en</i> <i>su concienciación</i><i>.</i>”<span typeof="mw:Extension/ref"><a href="#ref-5">\[</a></span><span typeof="mw:Extension/ref"><a href="#ref-5">5</a></span><span typeof="mw:Extension/ref"><a href="#ref-5">\]</a></span><b>(<a href="Nielsen">*Nielsen</a> </b> <b>equivalencia)</b><b>.</b> |
|||
After wordbound blanks: |
|||
Source: <p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p> |
Source: <p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p> |
||
Current Translation: <p><b><i>Mis vidas</i><br/>de hermana</b> <u>en Gales</u></p> |
Current Translation: <p><b><i>Mis vidas</i><br/>de hermana</b> <u>en Gales</u></p> |
||
Ideal Translation: |
Ideal Translation: |
||
After wordbound blanks: |
After wordbound blanks: <p><b><i>Mis</i></b> <b>vidas</b> <br>de <b><i>hermana</i></b> <u>en Gales</u></p> |
||
Source: <b>The</b> <i>sister</i>'s <em>dog</em> |
Source: <b>The</b> <i>sister</i>'s <em>dog</em> |
||
Current Translation: <b>El perro</i> de la <em></b> <i>hermana</em> |
Current Translation: <b>El perro</i> de la <em></b> <i>hermana</em> |
||
Ideal Translation: |
Ideal Translation: |
||
After wordbound blanks: |
After wordbound blanks: <em>El perro</em> <b>de l</b> <i>a hermana</i> |
||
Note: Need to check! |
|||
</pre> |
</pre> |
||
Line 428: | Line 141: | ||
Current Translation: <p>Una <b>prenda</b> <i>de BBC</i> japonesa</p> |
Current Translation: <p>Una <b>prenda</b> <i>de BBC</i> japonesa</p> |
||
Ideal Translation: |
Ideal Translation: |
||
After wordbound blanks: |
After wordbound blanks: <p>Una prenda de <i>BBC</i> <b>japonesa</b> </p> |
||
Source: <div>A <b>modern</b> Britain.</div> |
Source: <div>A <b>modern</b> Britain.</div> |
||
Current Translation: <div>Una <b>Gran Bretaña</b> moderna.</div> |
Current Translation: <div>Una <b>Gran Bretaña</b> moderna.</div> |
||
Ideal Translation: <div>Una Gran Bretaña <b>moderna</b>.</div> |
Ideal Translation: <div>Una Gran Bretaña <b>moderna</b>.</div> |
||
After wordbound blanks: |
After wordbound blanks: <div>Una Gran Bretaña <b>moderna</b> .</div> |
||
Source: <p>The <b>big <i>red</i></b> dog</p> |
Source: <p>The <b>big <i>red</i></b> dog</p> |
||
Current Translation: <p>El <b>perro <i>rojo</i></b> grande</p> |
Current Translation: <p>El <b>perro <i>rojo</i></b> grande</p> |
||
Ideal Translation: <p>El perro <b><i>rojo</i></b> <b>grande</b></p> |
Ideal Translation: <p>El perro <b><i>rojo</i></b> <b>grande</b></p> |
||
After wordbound blanks: |
After wordbound blanks: <p>El perro <b> <i>rojo</i></b> <b>grande</b> </p> |
||
Source: <p>He said "<i>I tile <a href="x">bathrooms</a>.</i>"</p> |
Source: <p>He said "<i>I tile <a href="x">bathrooms</a>.</i>"</p> |
||
Current Translation: <p> Diga "<i>#I baños <a href="x">de azulejo</a>.</i>"</p> |
Current Translation: <p> Diga "<i>#I baños <a href="x">de azulejo</a>.</i>"</p> |
||
Ideal Translation: <p>Diga que "<i>enladrillo</i> <i><a href="x">baños</a></i>."</p> |
Ideal Translation: <p>Diga que "<i>enladrillo</i> <i><a href="x">baños</a></i>."</p> |
||
After wordbound blanks: |
After wordbound blanks: <p>Diga "<i>#I <a href="x">baños</a></i> de <i>azulejo.</i>"</p> |
||
Source: <p>The <b>big red</b> dog</p> |
Source: <p>The <b>big red</b> dog</p> |
||
Current Translation: <p>El <b>perro rojo</b> grande</p> |
Current Translation: <p>El <b>perro rojo</b> grande</p> |
||
Ideal Translation: <p>El perro <b>rojo grande</b></p> |
Ideal Translation: <p>El perro <b>rojo grande</b></p> |
||
After wordbound blanks: |
After wordbound blanks: <p>El perro <b>rojo grande</b> </p> |
||
Source: <p>The <b>big</b> <b>red</b> dog</p> |
Source: <p>The <b>big</b> <b>red</b> dog</p> |
||
Current Translation: <p>El <b>perro</b> <b>rojo</b> grande</p> |
Current Translation: <p>El <b>perro</b> <b>rojo</b> grande</p> |
||
Ideal Translation: <p>El perro <b>rojo</b> <b>grande</b></p> |
Ideal Translation: <p>El perro <b>rojo</b> <b>grande</b></p> |
||
After wordbound blanks: |
After wordbound blanks: <p>El perro <b>rojo</b> <b>grande</b> </p> |
||
Source: <p>The <a href="1">big</a> <a href="2">red</a> dog</p> |
Source: <p>The <a href="1">big</a> <a href="2">red</a> dog</p> |
||
Current Translation: <p>El <a href="1">perro</a> <a href="2">rojo</a> grande</p> |
Current Translation: <p>El <a href="1">perro</a> <a href="2">rojo</a> grande</p> |
||
Ideal Translation: <p>El perro <a href="2">rojo</a> <a href="1">grande</a></p> |
Ideal Translation: <p>El perro <a href="2">rojo</a> <a href="1">grande</a></p> |
||
After wordbound blanks: |
After wordbound blanks: <p>El perro <a href="2">rojo</a> <a href="1">grande</a> </p> |
||
Source: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, which has an <b>executive editor</b> over the news pages and an <b>editorial page editor</b> over opinion pages.</span></p> |
Source: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, which has an <b>executive editor</b> over the news pages and an <b>editorial page editor</b> over opinion pages.</span></p> |
||
Current Translation: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> |
Current Translation: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> |
||
Ideal Translation: <p id="8"><span data-segmentid="9" class="cx-segment"><a title="The New York Times" rel="mw:WikiLink" href="./The_New_York_Times" data-linkid="17" class="cx-link">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> |
Ideal Translation: <p id="8"><span data-segmentid="9" class="cx-segment"><a title="The New York Times" rel="mw:WikiLink" href="./The_New_York_Times" data-linkid="17" class="cx-link">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> |
||
After wordbound blanks: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor</b> de <b>página</b> del <b>editorial</b> encima páginas de opinión.</span></p> |
|||
After wordbound blanks: |
|||
</pre> |
</pre> |
||
= Previous Attempts = |
|||
== Tests == |
|||
<pre> |
|||
Input: |
|||
The [[t:b:qum2bhp]]big [[t:b:qum2bhp; t:i:M0JZW3Q]]red dog[] |
|||
Transfer Input: |
|||
^The<det><def><sp>/El<det><def><GD><ND>$ [[t:b:qum2bhp]]^big<adj><sint>/grande<adj><mf>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^red<adj>/rojo<adj>$ ^dog<n><sg>/perro<n><GD><sg>$ |
|||
Transfer Output: |
|||
^Det_nom_adj_adj<SN><DET><GD><sg>{^el<det><def><3><4>$ [[t:b:qum2bhp]]^perro<n><3><4>$ [[t:b:qum2bhp; t:i:M0JZW3Q]]^rojo<adj><3><4>$ ^grande<adj><mf><4>$}$ |
|||
</pre> |
|||
* https://github.com/unhammer/apertium/blob/blank-handling/tests/pretransfer/__init__.py |
|||
== Previous Attempts == |
|||
* https://wiki.apertium.org/wiki/User:SilentFlame/Progress |
* https://wiki.apertium.org/wiki/User:SilentFlame/Progress |
||
Line 490: | Line 188: | ||
* https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c |
* https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c |
||
References |
= References = |
||
* https://wiki.apertium.org/wiki/Reordering_superblanks |
|||
* https://wiki.apertium.org/wiki/Format_handling |
|||
* https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Automatic_blank_handling |
|||
* https://www.mediawiki.org/wiki/Content_translation/Markup#Annotation_mapping_using_translation_subsequence_approximation |
|||
* https://www.mediawiki.org/wiki/Content_translation/Developers/Markup |
* https://www.mediawiki.org/wiki/Content_translation/Developers/Markup |
||
* https://www.mediawiki.org/wiki/Content_translation/Product_Definition/LinearDoc |
* https://www.mediawiki.org/wiki/Content_translation/Product_Definition/LinearDoc |
||
* https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm |
* https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm |
||
* https://sourceforge.net/p/apertium/mailman/apertium-stuff/thread/20cf28cd0904300204v45f35e51i118f4d146f83748@mail.gmail.com/ |
Latest revision as of 09:52, 30 August 2020
This page will follow the development of word bound blanks in the apertium stream format.
Contents
- 1 Features
- 1.1 Transfer (Pull Request 1, Pull Request 2, Commit 1, Commit 2, Commit 3, Pull Request 3, Commit 4)
- 1.2 Recursive Transfer (Pull Request)
- 1.3 Pretransfer (Pull Request)
- 1.4 Separable (Pull Request)
- 1.5 Analysis, Biltrans, Generation (Pull Request)
- 1.6 HFST Analysis, Generation (Pull Request)
- 1.7 Streamparser (Pull Request)
- 1.8 Postgeneration (Pull Request, Commit 1, Commit 2)
- 1.9 Tagger (Pull Request)
- 2 Rationale
- 3 Formalism
- 4 Full Pipe Testing
- 5 Previous Attempts
- 6 References
Features[edit]
Transfer (Pull Request 1, Pull Request 2, Commit 1, Commit 2, Commit 3, Pull Request 3, Commit 4)[edit]
Chunker/Single-stage transfer[edit]
- Wordbound blanks are a part of transfer word as a new side: blank.
- Are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh is clipped from.
- If the lem/lemh comes from a variable in the output then the balnk come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
- If rule pattern has only one LU, the wordbound blank gets output with all output LUs of the rule
- When using apertium-transfer -n, the wblanks print as they're supposed to.
- Fix null flushing in transfer
- Store blanks in queue and output wherever user has
</b>
in the rule output, so that users don't have to define a blank position anymore. (Blank reordering is done by wordbound blanks now)
Interchunk[edit]
- No change needed as inter chunk doesn't access LUs inside the chunk.
- Blank handling changed so the user doesn't have to worry about the blank position anymore.
Postchunk[edit]
- Wordbound blanks are ignored in pattern matching
- Wordbound blanks are added just before the output LU from the LU that the lem/lemh/whole is clipped from.
- If the lem/lemh comes from a variable in the output then the blank comes from the LU which the lemma comes from, by tracing its variable assignment in .
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
- If rule pattern chunk has only one LU, the wordbound blank gets output with all output LUs of the rule
- Blank handling changed so the user doesn't have to worry about the blank position anymore.
Recursive Transfer (Pull Request)[edit]
- Wordbound blanks are read as part of LUs as a new side->wblank.
- Wblanks reorder with the LUs in transfer based on where the lemma is clipped from.
- Works even if lemma is clipped into a variable and the variable is later added in the output.
- No regression. Stream without wordbound blanks work as-is.
- Normal blanks don't move around while wordbound blanks move around.
- When MLUs are formed the blanks are merged.
- Tests added
Pretransfer (Pull Request)[edit]
- Wordbound blanks distribute across parts when compounds are split into individual LUs
Separable (Pull Request)[edit]
- Merge wordbound blanks and add to all LUs in rule output.
- Works for both autoseq and revautoseq.
Analysis, Biltrans, Generation (Pull Request)[edit]
- Parsing wordbound blanks as normal blanks for analysis, generation, biltrans.
- Added a test for wordbound blank analysis.
HFST Analysis, Generation (Pull Request)[edit]
- Parsing wordbound blanks as normal blanks for analysis and generation in hfst-proc.
Streamparser (Pull Request)[edit]
- Wordbound blanks parsed as part of a lexical unit in the stream parser.
- Can be accessed by class member:
LexicalUnit.wordbound_blank
.
Postgeneration (Pull Request, Commit 1, Commit 2)[edit]
- Wordbound blanks merge when words merge.
- Wordbound blanks apply to all output words when output of postgen rule are more than input words.
- No regression for postgeneration without wordbound blanks.
- Lots of tests added.
Tagger (Pull Request)[edit]
- Parse wblanks as normal blanks
Rationale[edit]
Wordbound blanks will store information about a lexical unit that can help us with several applications where we want to send information through the pipeline but this information can't be sent as tags because it would break the FST matching in the modules.
We want to store it with a lexical unit, as throughout the pipe lexical units split, merge, delete and get added, and we want that this information distributes over multiple output words, merges on the output words, etc.
Formalism[edit]
Wordbound blanks will be denoted by double square brackets and will always appear right before a Lexical Unit.
[[wordboundblank]]^LU<tags>$
If there is no Lexical Unit in the stream (before the morph analyser and after the generator), then we have an end wblank as well.
[[wordboundblank]]word[[/]] word2 word3 [[wordboundblank]]word4[[/]]
Full Pipe Testing[edit]
Current Translation Command: apertium-deshtml < html_input-eng.in | apertium -f none -d $PREFIX/apertium-eng-spa eng-spa | apertium-retxt
Wordbound blank with Transfuse Command: tf-html-fragment $PREFIX/apertium-eng-spa/modes/eng-spa.mode < html_input-eng.in
Spanish - Catalan[edit]
Source: <p>Es <s>además</s> de Valencia.</p> Current Translation: <p>És <s>a més de</s> València.</p> Ideal Translation: <p>Es <s>además</s> de Valencia.</p> After wordbound blanks: <p>És <s>a més</s> de València.</p>
Spanish - English[edit]
Source: legal <b>persons</b> Current Translation: Personas jurídicas <b></b> Ideal Translation: <b>Personas</b> legales After wordbound blanks: <b>Personas</b> legales Note: Multiword not recognised because of multiple blanks between the words. Can be updated if needed. Source: I <b>am</b> David Current Translation: <b>soy David</b> Ideal Translation: <b>Soy</b> David After wordbound blanks: <b>Soy</b> David Source: <p>Bees <b>cannot</b> swim</p> Current Translation: <p>Las abejas <b>no pueden</b> nadar</p> Ideal Translation: <p>Las Abejas <b>no pueden</b> nadar</p> After wordbound blanks: <p>Las abejas <b>no pueden</b> nadar</p> Source: <a href="Conway">Conway</a> stated that young <a href="children">children</a><i>“understand <a href="Object_permanence">object permanence</a>. <a href="Concealment">Concealed</a> <a href="Object">objects</a> feature in their awareness.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">Nielsen</a> equivalence).</b> Current Translation: <a href="Conway">*Conway</a> Declaró que los niños <a href="children">jóvenes</a><i>“entienden <a href="Object_permanence">permanencia de objeto</a>. <a href="Concealment">Encubierto</a> <a href="Object">objeta</a> característica en su concienciación.”</i><span typeof="mw:Extension/ref"><a href="#ref-5">[5]</a></span><b>(<a href="Nielsen">*Nielsen</a> equivalencia).</b> Ideal Translation: After wordbound blanks: <a href="Conway">*Conway</a> Declaró que los <a href="children">niños</a> jóvenes“<i>entienden <a href="Object_permanence">permanencia</a></i> de <i> <a href="Object_permanence">objeto</a></i><i>. <a href="Concealment">Encubierto</a> </i> <i><a href="Object">objeta</a> </i> <i>característica en</i> <i>su concienciación</i><i>.</i>”<span typeof="mw:Extension/ref"><a href="#ref-5">\[</a></span><span typeof="mw:Extension/ref"><a href="#ref-5">5</a></span><span typeof="mw:Extension/ref"><a href="#ref-5">\]</a></span><b>(<a href="Nielsen">*Nielsen</a> </b> <b>equivalencia)</b><b>.</b> Source: <p><b><i>my sister</i><br/>lives</b> <u>in Wales</u></p> Current Translation: <p><b><i>Mis vidas</i><br/>de hermana</b> <u>en Gales</u></p> Ideal Translation: After wordbound blanks: <p><b><i>Mis</i></b> <b>vidas</b> <br>de <b><i>hermana</i></b> <u>en Gales</u></p> Source: <b>The</b> <i>sister</i>'s <em>dog</em> Current Translation: <b>El perro</i> de la <em></b> <i>hermana</em> Ideal Translation: After wordbound blanks: <em>El perro</em> <b>de l</b> <i>a hermana</i> Note: Need to check!
From [tests]:
Source: <p>A <b>Japanese</b> <i>BBC</i> article</p> Current Translation: <p>Una <b>prenda</b> <i>de BBC</i> japonesa</p> Ideal Translation: After wordbound blanks: <p>Una prenda de <i>BBC</i> <b>japonesa</b> </p> Source: <div>A <b>modern</b> Britain.</div> Current Translation: <div>Una <b>Gran Bretaña</b> moderna.</div> Ideal Translation: <div>Una Gran Bretaña <b>moderna</b>.</div> After wordbound blanks: <div>Una Gran Bretaña <b>moderna</b> .</div> Source: <p>The <b>big <i>red</i></b> dog</p> Current Translation: <p>El <b>perro <i>rojo</i></b> grande</p> Ideal Translation: <p>El perro <b><i>rojo</i></b> <b>grande</b></p> After wordbound blanks: <p>El perro <b> <i>rojo</i></b> <b>grande</b> </p> Source: <p>He said "<i>I tile <a href="x">bathrooms</a>.</i>"</p> Current Translation: <p> Diga "<i>#I baños <a href="x">de azulejo</a>.</i>"</p> Ideal Translation: <p>Diga que "<i>enladrillo</i> <i><a href="x">baños</a></i>."</p> After wordbound blanks: <p>Diga "<i>#I <a href="x">baños</a></i> de <i>azulejo.</i>"</p> Source: <p>The <b>big red</b> dog</p> Current Translation: <p>El <b>perro rojo</b> grande</p> Ideal Translation: <p>El perro <b>rojo grande</b></p> After wordbound blanks: <p>El perro <b>rojo grande</b> </p> Source: <p>The <b>big</b> <b>red</b> dog</p> Current Translation: <p>El <b>perro</b> <b>rojo</b> grande</p> Ideal Translation: <p>El perro <b>rojo</b> <b>grande</b></p> After wordbound blanks: <p>El perro <b>rojo</b> <b>grande</b> </p> Source: <p>The <a href="1">big</a> <a href="2">red</a> dog</p> Current Translation: <p>El <a href="1">perro</a> <a href="2">rojo</a> grande</p> Ideal Translation: <p>El perro <a href="2">rojo</a> <a href="1">grande</a></p> After wordbound blanks: <p>El perro <a href="2">rojo</a> <a href="1">grande</a> </p> Source: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, which has an <b>executive editor</b> over the news pages and an <b>editorial page editor</b> over opinion pages.</span></p> Current Translation: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> Ideal Translation: <p id="8"><span data-segmentid="9" class="cx-segment"><a title="The New York Times" rel="mw:WikiLink" href="./The_New_York_Times" data-linkid="17" class="cx-link">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor de página del editorial</b> encima páginas de opinión.</span></p> After wordbound blanks: <p id="8"><span class="cx-segment" data-segmentid="9"><a class="cx-link" data-linkid="17" href="./The_New_York_Times" rel="mw:WikiLink" title="The New York Times">The New York Times</a>, el cual tiene un <b>editor ejecutivo</b> sobre las páginas noticiosas y un <b>editor</b> de <b>página</b> del <b>editorial</b> encima páginas de opinión.</span></p>
Previous Attempts[edit]
- https://wiki.apertium.org/wiki/User:SilentFlame/Progress
- https://github.com/junaidiiith/apertium/tree/blank-handling GsoC2016 project
- https://github.com/unhammer/apertium/tree/blank-handling older, unfinished implementation of the changes required in apertium-transfer, with notes at https://github.com/unhammer/apertium/blob/blank-handling/blank_notes.org#consequences-of-this-type-of-blank-handling
- https://github.com/junaidiiith/apertium
- https://github.com/junaidiiith/Apertium_Code
- https://github.com/unhammer/apertium/commit/b5c73fbe82544d83a98eb16b921c2fa224f6d40c
References[edit]
- https://wiki.apertium.org/wiki/Reordering_superblanks
- https://wiki.apertium.org/wiki/Format_handling
- https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Automatic_blank_handling
- https://www.mediawiki.org/wiki/Content_translation/Markup#Annotation_mapping_using_translation_subsequence_approximation
- https://www.mediawiki.org/wiki/Content_translation/Developers/Markup
- https://www.mediawiki.org/wiki/Content_translation/Product_Definition/LinearDoc
- https://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/superblank_handling_algorithm
- https://sourceforge.net/p/apertium/mailman/apertium-stuff/thread/20cf28cd0904300204v45f35e51i118f4d146f83748@mail.gmail.com/