User:Khannatanmai/Secondary tags features

From Apertium
Jump to navigation Jump to search

This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see User:Khannatanmai/New_Apertium_stream_format. This was done as part of the Google Summer of Code 2020. User:Khannatanmai/GSoC2020Proposal_Trimming. User:Khannatanmai/GSoC2020Progress.

Module-specific features

Chunker (t1x): Pull Request

  • Secondary tags (sectags) are ignored while pattern matching for rules.
  • Attribute "tags" (in t1x) gets only primary and not secondary tags. (Ensures no regression)
  • "whole" gets the whole LU including secondary tags.
  • New attribute "sectags" gets all secondary tags. (can be used in clip).
  • Secondary tags are added in the output LU from the LU that the lem/lemh is clipped from.
  • If the lem/lemh comes from a variable in the output then the stags come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
  • No regression. Stream without secondary tags work as-is.
  • Works with MLUs.
  • If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.

Example Usage (Here the secondary tags show the surface form):

SPA-ENG Input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]

Output:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]

CAT-ITA Input:

^Apertium<np><al><m><sg><sf:Apertium>/Apertium<np><org><m><sg><sf:Apertium>$ ^ser<vbser><pri><p3><sg><sf:és>/essere<vbser><pri><p3><sg><sf:és>$ ^un<det><ind><f><sg><sf:una>/un<det><ind><f><sg><sf:una>$ ^plataforma<n><f><sg><sf:plataforma>/piattaforma<n><f><sg><sf:plataforma>$ ^de<pr><sf:de>/di<pr><sf:de>$ ^traducció<n><f><sg><sf:traducció>/traduzione<n><f><sg><sf:traducció>$ ^automàtic<adj><f><sg><sf:automàtica>/automatico<adj><f><sg><sf:automàtica>$ ^lliure<adj><mf><sg><sf:lliure>/libero<adj><GD><sg><sf:lliure>$ ^i<cnjcoo><sf:i>/e<cnjcoo><sf:i>$ ^obert<adj><f><sg><sf:oberta>/aperto<adj><f><sg><sf:oberta>$^.<sent><sf:.>/.<sent><sf:.>$

Output without modification:

^np<SN><m><sg>{^Apertium<np><org><m><sg>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3>$ ^piattaforma<n><2><3>$}$ ^default<default>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3>$ ^automatico<adj><2><3>$ ^libero<adj><2><3>$}$ ^default<default>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3>$}$^default<default>{^.<sent><sf:.>$}$

Output after modification:

^np<SN><m><sg>{^Apertium<np><org><m><sg><sf:Apertium>$}$ ^essere<SV><vbser><pri><p3><sg>{^essere<vbser><pri><p3><5><sf:és>$}$ ^det_nom<SN><f><sg><sl_f><sl_sg>{^un<det><ind><2><3><sf:una>$ ^piattaforma<n><2><3><sf:plataforma>$}$ ^di<pr>{^di<pr><sf:de>$}$ ^nom_adj_adj<SN><f><sg><sl_f><sl_sg>{^traduzione<n><2><3><sf:traducció>$ ^automatico<adj><2><3><sf:automàtica>$ ^libero<adj><2><3><sf:lliure>$}$ ^cnjcoo<cnjcoo>{^e<cnjcoo><sf:i>$}$ ^adj<SA><f><sg>{^aperto<adj><2><3><sf:oberta>$}$^.<sent>{^.<sent><sf:.>$}$

MLU and Lemq examples

Input (No secondary tags):

^xyz# a bb<n><sg>/xyz# a vv<n><sg>$  ^to<pr>/ to<pr>$  ^abc# hg kg<vblex><inf>/abc# hg kg<vblex><inf>$  ^he<prn><obj><sg>/he<prn><obj><sg>$

Output:

^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4># a vv$  ^para<pr>$  ^abc<vblex><5>+he<prn><enc><sg># hg kg$}$  

Input (With secondary tags but no multiwords with invariable part):

^xyz<n><sg><sf:adasd>/xyz<n><sg><sf:adasd>$  ^to<pr><sf:to>/ to<pr><sf:to>$  ^abc<vblex><inf><sf:yada>/abc<vblex><inf><sf:yada>$  ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$

Output:

^nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd>$  ^para<pr><sf:to>$  ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2>$}$  

Input (With secondary tags and multiwords with invariable part):

^xyz# a bb<n><sg><sf:adasd>/xyz# a vv<n><sg><sf:adasd>$  ^to<pr><sf:to>/ to<pr><sf:to>$  ^abc# hg kg<vblex><inf><sf:yada>/abc# hg kg<vblex><inf><sf:yada>$  ^he<prn><obj><sg><sf:blah><id:2>/he<prn><obj><sg><sf:blah><id:2>$

Output:

^Nom_to_inf<SN><UNDET><sg><inf>{^xyz<n><4><sf:adasd># a vv$  ^para<pr><sf:to>$  ^abc<vblex><5><sf:yada>+he<prn><enc><sg><sf:blah><id:2># hg kg$}$ 

Generator: Pull Request

  • Removes all trailing secondary tags from the input before giving it to FST matching.
  • For input without secondary tags it works as earlier with no regression.
  • All escaped characters are ignored inside secondary tags, as well as unescaped special characters ($,#,etc.) This applies for the tag prefix as well

This is needed for generation. Input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
]

Earlier Output:

 #The #dog #of #the #boy #run #fast#.#.[][
]

New Output:

The dogs of the boy run fast..[][
]

Tests:

Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4\#sabasa><id:2\#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:sabasa><id:2># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:4#sabasa><id:2#:># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
Tanmais-MacBook-Pro:transfer khannatanmai$ echo "^Stroke<n><sg><sf:$$4#saba$sa><id:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
  • Prefixes can have unescaped special characters as well:
echo "^Stroke<n><sg><$$s#^f:$$4#saba$sa><i#$$#^d:2#:$># of genius$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
Stroke of genius
  • Works with compounds:
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs>+not<adv># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz>+not<adv><sf:abc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry
Tanmais-MacBook-Pro:lt_proc khannatanmai$ echo "^be<vblex><subs><sf:xyz><id:++$$#>+not<adv><s$f:$+$a##bc># sorry$" | lt-proc -g ../../apertium-eng-spa/spa-eng.autogen.bin
being not sorry