Difference between revisions of "User:Khannatanmai/Secondary tags features"

From Apertium
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
  +
<strong style="color:maroon;font-size:1.5em;>The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using [[User:Khannatanmai/Wordbound_blanks | wordbound blanks]]. </strong>
  +
 
This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see [[User:Khannatanmai/New_Apertium_stream_format]]. This was done as part of the Google Summer of Code 2020. [[User:Khannatanmai/GSoC2020Proposal_Trimming]]. [[User:Khannatanmai/GSoC2020Progress]].
 
This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see [[User:Khannatanmai/New_Apertium_stream_format]]. This was done as part of the Google Summer of Code 2020. [[User:Khannatanmai/GSoC2020Proposal_Trimming]]. [[User:Khannatanmai/GSoC2020Progress]].
  +
  +
For examples and tests, see the [[User_talk:Khannatanmai/Secondary tags features|talk page]]
   
 
= Module-specific features =
 
= Module-specific features =
Line 12: Line 16:
 
* Works with MLUs.
 
* Works with MLUs.
 
* If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.
 
* If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.
 
'''Example Usage (Here the secondary tags show the surface form):'''
 
 
SPA-ENG
 
Input:
 
<pre>
 
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]
 
</pre>
 
 
Output:
 
<pre>
 
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]
 
</pre>
 
 
CAT-ITA
 
Input:
 
<pre>
 
^Apertium<np><al><m><sg><sf:Apertium>/Apertium<np><org><m><sg><sf:Apertium>$ ^ser<vbser><pri><p3><sg><sf:és>/essere<vbser><pri><p3><sg><sf:és>$ ^un<det><ind><f><sg><sf:una>/un<det><ind><f><sg><sf:una>$ ^plataforma<n><f><sg><sf:plataforma>/piattaforma<n><f><sg><sf:plataforma>$ ^de<pr><sf:de>/di<pr><sf:de>$ ^traducció<n><f><sg><sf:traducció>/traduzione<n><f><sg><sf:traducció>$ ^automàtic<adj><f><sg><sf:automàtica>/automatico<adj><f><sg><sf:automàtica>$ ^lliure<adj><mf><sg><sf:lliure>/libero<adj><GD><sg><sf:lliure>$ ^i<cnjcoo><sf:i>/e<cnjcoo><sf:i>$ ^obert<adj><f><sg><sf:oberta>/aperto<adj><f><sg><sf:oberta>$^.<sent><sf:.>/.<sent><sf:.>$
 
</pre>
 
 
Output without modification:
 
<pre>
 
^np<SN><m><sg>{^Apertium<np><org><m><sg>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3>$ ^piattaforma<n><2><3>$}$ ^default<default>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3>$ ^automatico<adj><2><3>$ ^libero<adj><2><3>$}$ ^default<default>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3>$}$^default<default>{^.<sent><sf:.>$}$
 
</pre>
 
 
Output after modification:
 
<pre>
 
^np<SN><m><sg>{^Apertium<np><org><m><sg><sf:Apertium>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5><sf:és>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3><sf:una>$ ^piattaforma<n><2><3><sf:plataforma>$}$ ^di<pr>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3><sf:traducció>$ ^automatico<adj><2><3><sf:automàtica>$ ^libero<adj><2><3><sf:lliure>$}$ ^cnjcoo<cnjcoo>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3><sf:oberta>$}$^.<sent>{^.<sent><sf:.>$}$
 
</pre>
 
 
'''MLU and Lemq examples'''
 
 
Input (No secondary tags):
 
<pre>
 
^xyz# a bb<n><sg>/xyz# a vv<n><sg>$ ^to<pr>/ to<pr>$ ^abc# hg kg<vblex><inf>/abc# hg kg<vblex><inf>$ ^he<prn><obj><sg>/he<prn><obj><sg>$
 
</pre>
 
Output:
 
<pre>
 
^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4># a vv$ ^para<pr>$ ^abc<vblex><5>+he<prn><enc><sg># hg kg$}$
 
</pre>
 
Input (With secondary tags but no multiwords with invariable part):
 
<pre>
 
^xyz<n><sg><sf:adasd>/xyz<n><sg><sf:adasd>$ ^to<pr><sf:to>/ to<pr><sf:to>$ ^abc<vblex><inf><sf:yada>/abc<vblex><inf><sf:yada>$ ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$
 
</pre>
 
Output:
 
<pre>
 
^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd>$ ^para<pr><sf:to>$ ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2>$}$
 
</pre>
 
Input (With secondary tags and multiwords with invariable part):
 
<pre>
 
^xyz# a bb<n><sg><sf:adasd>/xyz# a vv<n><sg><sf:adasd>$ ^to<pr><sf:to>/ to<pr><sf:to>$ ^abc# hg kg<vblex><inf><sf:yada>/abc# hg kg<vblex><inf><sf:yada>$ ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$
 
</pre>
 
Output:
 
<pre>
 
^Nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd># a vv$ ^para<pr><sf:to>$ ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2># hg kg$}$
 
</pre>
 
   
 
== Generator: [https://github.com/apertium/lttoolbox/pull/83 Pull Request] ==
 
== Generator: [https://github.com/apertium/lttoolbox/pull/83 Pull Request] ==
Line 75: Line 23:
   
 
This is needed for generation.
 
This is needed for generation.
Input:
 
<pre>
 
^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
 
]
 
</pre>
 
Earlier Output:
 
<pre>
 
#The #dog #of #the #boy #run #fast#.#.[][
 
]
 
</pre>
 
New Output:
 
<pre>
 
The dogs of the boy run fast..[][
 
]
 
</pre>
 
Tests:
 
<pre>
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4\#sabasa><id:2\#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:sabasa><id:2># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4#sabasa><id:2#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:$$4#saba$sa><id:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
</pre>
 
 
* Prefixes can have unescaped special characters as well:
 
<pre>
 
echo "^Stroke<n><sg><$$s#^f:$$4#saba$sa><i#$$#^d:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
Stroke of genius
 
</pre>
 
 
* Works with compounds:
 
<pre>
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs>+not<adv># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz>+not<adv><sf:abc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz><id:++$$#>+not<adv><s$f:$+$a##bc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
 
being not sorry
 
</pre>
 

Latest revision as of 09:24, 17 July 2020

The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using wordbound blanks.

This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see User:Khannatanmai/New_Apertium_stream_format. This was done as part of the Google Summer of Code 2020. User:Khannatanmai/GSoC2020Proposal_Trimming. User:Khannatanmai/GSoC2020Progress.

For examples and tests, see the talk page

Module-specific features[edit]

Chunker (t1x): Pull Request[edit]

  • Secondary tags (sectags) are ignored while pattern matching for rules.
  • Attribute "tags" (in t1x) gets only primary and not secondary tags. (Ensures no regression)
  • "whole" gets the whole LU including secondary tags.
  • New attribute "sectags" gets all secondary tags. (can be used in clip).
  • Secondary tags are added in the output LU from the LU that the lem/lemh is clipped from.
  • If the lem/lemh comes from a variable in the output then the stags come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
  • No regression. Stream without secondary tags work as-is.
  • Works with MLUs.
  • If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.

Generator: Pull Request[edit]

  • Removes all trailing secondary tags from the input before giving it to FST matching.
  • For input without secondary tags it works as earlier with no regression.
  • All escaped characters are ignored inside secondary tags, as well as unescaped special characters ($,#,etc.) This applies for the tag prefix as well

This is needed for generation.