Difference between revisions of "User:Junzay/Blank handling"
(→TODO) |
|||
(19 intermediate revisions by 2 users not shown) | |||
Line 4: | Line 4: | ||
==What works currently== |
==What works currently== |
||
The deformatter and the reformatter work. There's still more testing that needs to be done. |
|||
The fst processor works fine to distribute the tags efficiently and correctly to the words. |
|||
The pretransfer works fine with testing phase completed. |
|||
The transfer, interchunk and post-chunk are completed, but still more testing needs to be done. |
|||
This is how the chain works as of now: |
|||
== |
==Deformatter== |
||
The deformatter links every word with its inline tag before the word |
|||
* Fill in the "what works" section above |
|||
* Makefile for deformatter/reformatter code |
|||
* Write tests for transfer/interchunk using the structure of the pretransfer tests |
|||
* Fix <code><nowiki>[1]</nowiki></code> being printed twice and <code><nowiki>[4]</nowiki></code> not at all when testing with apertium-nno-nob: |
|||
<pre>$ git log -1 |
|||
commit 6ec869c012b2965f619e0a0532b8ca4cdf335d18 |
|||
Author: junaidiiith <junaid695683@gmail.com> |
|||
Date: Sun Jul 31 17:51:54 2016 +0530 |
|||
Before deformatter: |
|||
Transfer and interchunk updated |
|||
<pre><p><i>Hello brother</i> How are you <u>doing</u> Do you see <b>the point</b> I <u>couldn't</u> do it</p></pre> |
|||
After deformatter: |
|||
$ make --quiet |
|||
<pre>[5][{1}]Hello brother[] How are you [{2}]doing[] Do you see [{3}]the point[] I [{4}]couldn't[] do it[6]</pre> |
|||
Making all in apertium |
|||
$ echo '[1]^gen-prep<pr>{^til<pr>$}$ [3]^n<n><m><sg><def><gen>{^bil<n><m><sg><def>$}$[4]^n<n><nt><sg><ind>{[{2}]^problem<n><nt><sg><ind>$}$[]' \ |
|||
==Lt-proc== |
|||
| apertium/apertium-interchunk /l/n/apertium-nno-nob.nob-nno.t2x /l/n/nob-nno.t2x.bin 2>/dev/null |
|||
lt-proc distributes the tags efficiently to all the words and also handles the inline tags across MWE's |
|||
[1]^n<n><nt><sg><def>{[{2}]^problem<n><nt><sg><ind>$}$[1]^gen-prep<pr>{^til<pr>$}$ [3]^n<n><m><sg><def><gen>{^bil<n><m><sg><def>$}$[] |
|||
After lt-proc: |
|||
<pre>[5][{1}]^Hello<ij>$[{1}]^brother<n><sg>$[] ^How<adv><itg>$ |
|||
^be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ |
|||
[{2}]^do<vblex><ger>$[] ^Do<vbdo><pres>$ |
|||
^prpers<prn><subj><p2><mf><sp>$ [{3}]^see<vblex><pres># the point$[] |
|||
^prpers<prn><subj><p1><mf><sg>$ |
|||
[{4}]^can<vaux><past>+not<adv>$[] ^do<vbdo><pres>$ ^prpers<prn><subj><p3><nt><sg>$[6]</pre> |
|||
==Pretransfer== |
|||
The tags before a word lu involving '#' or '+' are distributed to the other words as well- eg [{4}]^can<vaux><past>$ [{4}]^not<adv>$ |
|||
After pretransfer: |
|||
<pre>[5][{1}]^Hello<ij>$[{1}]^brother<n><sg>$[] |
|||
^How<adv><itg>$ ^be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ |
|||
[{2}]^do<vblex><ger>$[] ^Do<vbdo><pres>$ ^prpers<prn><subj><p2><mf><sp>$ |
|||
[{3}]^see# the point<vblex><pres>$[] ^prpers<prn><subj><p1><mf><sg>$ [{4}]^can<vaux><past>$ [{4}]^not<adv>$[] ^do<vbdo><pres>$ ^prpers<prn><subj><p3><nt><sg>$[6] </pre> |
|||
==Transfer== |
|||
The inline tags are linked with each word inside the chunk |
|||
After transfer: |
|||
<pre>[5]^default<default>{[{1}]^Hola<ij>$[]}$^Nom<SN><UNDET><m><sg>{[{1}]^hermano<n><3><4>$[]}$ |
|||
^adv<adv><itg>{^Cómo<adv><itg>$}$ ^verbcj<SV><vbser><pri><p2><sg>{^ser<vbser><3><4><5>$ ^prpers<prn><subj><p2><mf><sg>$}$ |
|||
^ger<SV><vblex><ger><PD><ND>{[{2}]^hacer<vblex><3>$[]}$ ^prnsubj<SN><tn><p2><mf><sg>{^prpers<prn><2><p2><4><sg>$}$ |
|||
^verbcj<SV><vblex><pri><PD><ND>{[{3}]^coger<vblex><3><4><5># la gracia$[]}$ ^prnsubj<SN><tn><p1><mf><sg>{^prpers<prn><2><p1><4><sg>$}$ |
|||
^mod<SV><vbmod><cni><PD><ND>{[{4}]^poder<vbmod><3><4><5>$[]}$ ^adv<adv><NEG>{[{4}]^no<adv>$[]}$ ^prnsubj<SN><tn><p3><m><sg>{^prpers<prn><2><p3><4><sg>$}$ [6]</pre> |
|||
==Interchunk== |
|||
In interchunk all the superblanks corresponding to every chunk are output before the reordering of the chunk so as to avoid <b>Superblank Reordering</b> |
|||
After interchunk: |
|||
<pre>[5]^default<default>{[{1}]^Hola<ij>$[]}$ ^Nom<SN><PDET><m><sg>{[{1}]^hermano<n><3><4>$[]}$ |
|||
^adv<adv><itg>{^Cómo<adv><itg>$}$ ^verbcj<SV><vbser><pri><p2><sg>{^ser<vbser><3><4><5>$ ^prpers<prn><subj><p2><mf><sg>$}$ |
|||
^ger<SV><vblex><ger><PD><ND>{[{2}]^hacer<vblex><3>$[]}$ ^verbcj<SV><vblex><pri><p2><sg>{[{3}]^coger<vblex><3><4><5># la gracia$[]}$ |
|||
^mod<SV><vbmod><cni><p1><sg>{[{4}]^poder<vbmod><3><4><5>$[]}$ ^adv<adv><NEG>{[{4}]^no<adv>$[]}$ ^prnsubj<SN><tn><p3><m><sg>{^prpers<prn><2><p3><4><sg>$}$ [6]</pre> |
|||
==Postchunk== |
|||
After postchunk: |
|||
<pre>[5][{1}]^Hola<ij>$[] ^El<det><def><m><sg>$ |
|||
[{1}]^hermano<n><m><sg>$ ^Cómo<adv><itg>$ ^ser<vbser><pri><p2><sg>$ |
|||
[{2}]^hacer<vblex><ger>$ [{3}]^coger<vblex><pri><p2><sg># la gracia$ |
|||
[{4}]^poder<vbmod><cni><p1><sg>$[] [{4}]^no<adv>$[] ^prpers<prn><tn><p3><m><sg>$ [6]</pre> |
|||
==Generator== |
|||
After generator |
|||
<pre>[5][{1}]Hola[] El [{1}]hermano Cómo eres [{2}]haciendo [{3}]coges la gracia [{4}]podría[] [{4}]no[] él [6]</pre> |
|||
==Reformatter== |
|||
The libtidy module beautifies the input and reformats it to give the output |
|||
<pre><html> |
|||
<head> |
|||
<title></title> |
|||
</head> |
|||
<body> |
|||
<p> |
|||
<i>Hola</i> El |
|||
<i>hermano Cómo eres</i> |
|||
<u>haciendo</u> |
|||
<b>coges la gracia</b> |
|||
<u>podría</u> |
|||
<u>no</u> él</p> |
|||
</body> |
|||
</html> |
|||
</pre> |
</pre> |
||
==Reordering Superblank issue== |
|||
http://wiki.apertium.org/wiki/Reordering_superblanks, The problem on this page is dealt with. |
|||
On running the example through the chain |
|||
<pre> |
|||
<p><i>Perro</i> <b>blanco</b></p> |
|||
</pre> |
|||
The output: |
|||
<pre> |
|||
<html> |
|||
<head> |
|||
<title></title> |
|||
</head> |
|||
<body> |
|||
<p> |
|||
<b>White</b> |
|||
<i>dog</i> |
|||
</p> |
|||
</body> |
|||
</html> |
|||
</pre> |
|||
==Repositories== |
|||
Apertium: https://github.com/junaidiiith/apertium/tree/blank-handling |
|||
<br/> |
|||
lttoolbox: https://github.com/junaidiiith/lttoolbox |
|||
==See also== |
==See also== |
Latest revision as of 20:06, 16 August 2016
GsoC 2016 project
Code at https://github.com/junaidiiith/Apertium / https://github.com/junaidiiith/Apertium_Code
Contents
What works currently[edit]
The deformatter and the reformatter work. There's still more testing that needs to be done. The fst processor works fine to distribute the tags efficiently and correctly to the words. The pretransfer works fine with testing phase completed. The transfer, interchunk and post-chunk are completed, but still more testing needs to be done. This is how the chain works as of now:
Deformatter[edit]
The deformatter links every word with its inline tag before the word
Before deformatter:
<p><i>Hello brother</i> How are you <u>doing</u> Do you see <b>the point</b> I <u>couldn't</u> do it</p>
After deformatter:
[5][{1}]Hello brother[] How are you [{2}]doing[] Do you see [{3}]the point[] I [{4}]couldn't[] do it[6]
Lt-proc[edit]
lt-proc distributes the tags efficiently to all the words and also handles the inline tags across MWE's
After lt-proc:
[5][{1}]^Hello<ij>$[{1}]^brother<n><sg>$[] ^How<adv><itg>$ ^be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ [{2}]^do<vblex><ger>$[] ^Do<vbdo><pres>$ ^prpers<prn><subj><p2><mf><sp>$ [{3}]^see<vblex><pres># the point$[] ^prpers<prn><subj><p1><mf><sg>$ [{4}]^can<vaux><past>+not<adv>$[] ^do<vbdo><pres>$ ^prpers<prn><subj><p3><nt><sg>$[6]
Pretransfer[edit]
The tags before a word lu involving '#' or '+' are distributed to the other words as well- eg [{4}]^can<vaux><past>$ [{4}]^not<adv>$
After pretransfer:
[5][{1}]^Hello<ij>$[{1}]^brother<n><sg>$[] ^How<adv><itg>$ ^be<vbser><pres>$ ^prpers<prn><obj><p2><mf><sp>$ [{2}]^do<vblex><ger>$[] ^Do<vbdo><pres>$ ^prpers<prn><subj><p2><mf><sp>$ [{3}]^see# the point<vblex><pres>$[] ^prpers<prn><subj><p1><mf><sg>$ [{4}]^can<vaux><past>$ [{4}]^not<adv>$[] ^do<vbdo><pres>$ ^prpers<prn><subj><p3><nt><sg>$[6]
Transfer[edit]
The inline tags are linked with each word inside the chunk
After transfer:
[5]^default<default>{[{1}]^Hola<ij>$[]}$^Nom<SN><UNDET><m><sg>{[{1}]^hermano<n><3><4>$[]}$ ^adv<adv><itg>{^Cómo<adv><itg>$}$ ^verbcj<SV><vbser><pri><p2><sg>{^ser<vbser><3><4><5>$ ^prpers<prn><subj><p2><mf><sg>$}$ ^ger<SV><vblex><ger><PD><ND>{[{2}]^hacer<vblex><3>$[]}$ ^prnsubj<SN><tn><p2><mf><sg>{^prpers<prn><2><p2><4><sg>$}$ ^verbcj<SV><vblex><pri><PD><ND>{[{3}]^coger<vblex><3><4><5># la gracia$[]}$ ^prnsubj<SN><tn><p1><mf><sg>{^prpers<prn><2><p1><4><sg>$}$ ^mod<SV><vbmod><cni><PD><ND>{[{4}]^poder<vbmod><3><4><5>$[]}$ ^adv<adv><NEG>{[{4}]^no<adv>$[]}$ ^prnsubj<SN><tn><p3><m><sg>{^prpers<prn><2><p3><4><sg>$}$ [6]
Interchunk[edit]
In interchunk all the superblanks corresponding to every chunk are output before the reordering of the chunk so as to avoid Superblank Reordering
After interchunk:
[5]^default<default>{[{1}]^Hola<ij>$[]}$ ^Nom<SN><PDET><m><sg>{[{1}]^hermano<n><3><4>$[]}$ ^adv<adv><itg>{^Cómo<adv><itg>$}$ ^verbcj<SV><vbser><pri><p2><sg>{^ser<vbser><3><4><5>$ ^prpers<prn><subj><p2><mf><sg>$}$ ^ger<SV><vblex><ger><PD><ND>{[{2}]^hacer<vblex><3>$[]}$ ^verbcj<SV><vblex><pri><p2><sg>{[{3}]^coger<vblex><3><4><5># la gracia$[]}$ ^mod<SV><vbmod><cni><p1><sg>{[{4}]^poder<vbmod><3><4><5>$[]}$ ^adv<adv><NEG>{[{4}]^no<adv>$[]}$ ^prnsubj<SN><tn><p3><m><sg>{^prpers<prn><2><p3><4><sg>$}$ [6]
Postchunk[edit]
After postchunk:
[5][{1}]^Hola<ij>$[] ^El<det><def><m><sg>$ [{1}]^hermano<n><m><sg>$ ^Cómo<adv><itg>$ ^ser<vbser><pri><p2><sg>$ [{2}]^hacer<vblex><ger>$ [{3}]^coger<vblex><pri><p2><sg># la gracia$ [{4}]^poder<vbmod><cni><p1><sg>$[] [{4}]^no<adv>$[] ^prpers<prn><tn><p3><m><sg>$ [6]
Generator[edit]
After generator
[5][{1}]Hola[] El [{1}]hermano Cómo eres [{2}]haciendo [{3}]coges la gracia [{4}]podría[] [{4}]no[] él [6]
Reformatter[edit]
The libtidy module beautifies the input and reformats it to give the output
<html> <head> <title></title> </head> <body> <p> <i>Hola</i> El <i>hermano Cómo eres</i> <u>haciendo</u> <b>coges la gracia</b> <u>podría</u> <u>no</u> él</p> </body> </html>
Reordering Superblank issue[edit]
http://wiki.apertium.org/wiki/Reordering_superblanks, The problem on this page is dealt with.
On running the example through the chain
<p><i>Perro</i> <b>blanco</b></p>
The output:
<html> <head> <title></title> </head> <body> <p> <b>White</b> <i>dog</i> </p> </body> </html>
Repositories[edit]
Apertium: https://github.com/junaidiiith/apertium/tree/blank-handling
lttoolbox: https://github.com/junaidiiith/lttoolbox