Difference between revisions of "User:Khannatanmai/Secondary tags features"

From Apertium
Jump to navigation Jump to search
Line 14: Line 14:
 
* Works with MLUs.
 
* Works with MLUs.
 
* If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.
 
* If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.
 
'''Example Usage (Here the secondary tags show the surface form):'''
 
 
SPA-ENG
 
Input:
 
<pre>
 
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]
 
</pre>
 
 
Output:
 
<pre>
 
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]
 
</pre>
 
 
CAT-ITA
 
Input:
 
<pre>
 
^Apertium<np><al><m><sg><sf:Apertium>/Apertium<np><org><m><sg><sf:Apertium>$ ^ser<vbser><pri><p3><sg><sf:és>/essere<vbser><pri><p3><sg><sf:és>$ ^un<det><ind><f><sg><sf:una>/un<det><ind><f><sg><sf:una>$ ^plataforma<n><f><sg><sf:plataforma>/piattaforma<n><f><sg><sf:plataforma>$ ^de<pr><sf:de>/di<pr><sf:de>$ ^traducció<n><f><sg><sf:traducció>/traduzione<n><f><sg><sf:traducció>$ ^automàtic<adj><f><sg><sf:automàtica>/automatico<adj><f><sg><sf:automàtica>$ ^lliure<adj><mf><sg><sf:lliure>/libero<adj><GD><sg><sf:lliure>$ ^i<cnjcoo><sf:i>/e<cnjcoo><sf:i>$ ^obert<adj><f><sg><sf:oberta>/aperto<adj><f><sg><sf:oberta>$^.<sent><sf:.>/.<sent><sf:.>$
 
</pre>
 
 
Output without modification:
 
<pre>
 
^np<SN><m><sg>{^Apertium<np><org><m><sg>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3>$ ^piattaforma<n><2><3>$}$ ^default<default>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3>$ ^automatico<adj><2><3>$ ^libero<adj><2><3>$}$ ^default<default>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3>$}$^default<default>{^.<sent><sf:.>$}$
 
</pre>
 
 
Output after modification:
 
<pre>
 
^np<SN><m><sg>{^Apertium<np><org><m><sg><sf:Apertium>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5><sf:és>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3><sf:una>$ ^piattaforma<n><2><3><sf:plataforma>$}$ ^di<pr>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3><sf:traducció>$ ^automatico<adj><2><3><sf:automàtica>$ ^libero<adj><2><3><sf:lliure>$}$ ^cnjcoo<cnjcoo>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3><sf:oberta>$}$^.<sent>{^.<sent><sf:.>$}$
 
</pre>
 
 
'''MLU and Lemq examples'''
 
 
Input (No secondary tags):
 
<pre>
 
^xyz# a bb<n><sg>/xyz# a vv<n><sg>$ ^to<pr>/ to<pr>$ ^abc# hg kg<vblex><inf>/abc# hg kg<vblex><inf>$ ^he<prn><obj><sg>/he<prn><obj><sg>$
 
</pre>
 
Output:
 
<pre>
 
^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4># a vv$ ^para<pr>$ ^abc<vblex><5>+he<prn><enc><sg># hg kg$}$
 
</pre>
 
Input (With secondary tags but no multiwords with invariable part):
 
<pre>
 
^xyz<n><sg><sf:adasd>/xyz<n><sg><sf:adasd>$ ^to<pr><sf:to>/ to<pr><sf:to>$ ^abc<vblex><inf><sf:yada>/abc<vblex><inf><sf:yada>$ ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$
 
</pre>
 
Output:
 
<pre>
 
^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd>$ ^para<pr><sf:to>$ ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2>$}$
 
</pre>
 
Input (With secondary tags and multiwords with invariable part):
 
<pre>
 
^xyz# a bb<n><sg><sf:adasd>/xyz# a vv<n><sg><sf:adasd>$ ^to<pr><sf:to>/ to<pr><sf:to>$ ^abc# hg kg<vblex><inf><sf:yada>/abc# hg kg<vblex><inf><sf:yada>$ ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$
 
</pre>
 
Output:
 
<pre>
 
^Nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd># a vv$ ^para<pr><sf:to>$ ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2># hg kg$}$
 
</pre>
 
   
 
== Generator: [https://github.com/apertium/lttoolbox/pull/83 Pull Request] ==
 
== Generator: [https://github.com/apertium/lttoolbox/pull/83 Pull Request] ==
Line 77: Line 21:
   
 
This is needed for generation.
 
This is needed for generation.
 
Input:
 
<pre>
 
^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
 
]
 
</pre>
 
Earlier Output:
 
<pre>
 
#The #dog #of #the #boy #run #fast#.#.[][
 
]
 
</pre>
 
New Output:
 
<pre>
 
The dogs of the boy run fast..[][
 
]
 
</pre>
 
Tests:
 
<pre>
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4\#sabasa><id:2\#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:sabasa><id:2># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4#sabasa><id:2#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:$$4#saba$sa><id:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
</pre>
 
 
* Prefixes can have unescaped special characters as well:
 
<pre>
 
echo "^Stroke<n><sg><$$s#^f:$$4#saba$sa><i#$$#^d:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
</pre>
 
 
* Works with compounds:
 
<pre>
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs>+not<adv># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz>+not<adv><sf:abc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz><id:++$$#>+not<adv><s$f:$+$a##bc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
</pre>
 

Revision as of 20:53, 17 May 2020

This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see User:Khannatanmai/New_Apertium_stream_format. This was done as part of the Google Summer of Code 2020. User:Khannatanmai/GSoC2020Proposal_Trimming. User:Khannatanmai/GSoC2020Progress.

For examples and tests, see the talk page

Module-specific features

Chunker (t1x): Pull Request

  • Secondary tags (sectags) are ignored while pattern matching for rules.
  • Attribute "tags" (in t1x) gets only primary and not secondary tags. (Ensures no regression)
  • "whole" gets the whole LU including secondary tags.
  • New attribute "sectags" gets all secondary tags. (can be used in clip).
  • Secondary tags are added in the output LU from the LU that the lem/lemh is clipped from.
  • If the lem/lemh comes from a variable in the output then the stags come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
  • No regression. Stream without secondary tags work as-is.
  • Works with MLUs.
  • If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.

Generator: Pull Request

  • Removes all trailing secondary tags from the input before giving it to FST matching.
  • For input without secondary tags it works as earlier with no regression.
  • All escaped characters are ignored inside secondary tags, as well as unescaped special characters ($,#,etc.) This applies for the tag prefix as well

This is needed for generation.