User:Khannatanmai/Secondary tags features

From Apertium
Jump to navigation Jump to search

The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using wordbound blanks.

This page will list all the features being added to the pipe to deal with secondary tags. To follow updates on the development, see User:Khannatanmai/New_Apertium_stream_format. This was done as part of the Google Summer of Code 2020. User:Khannatanmai/GSoC2020Proposal_Trimming. User:Khannatanmai/GSoC2020Progress.

For examples and tests, see the talk page

Module-specific features[edit]

Chunker (t1x): Pull Request[edit]

  • Secondary tags (sectags) are ignored while pattern matching for rules.
  • Attribute "tags" (in t1x) gets only primary and not secondary tags. (Ensures no regression)
  • "whole" gets the whole LU including secondary tags.
  • New attribute "sectags" gets all secondary tags. (can be used in clip).
  • Secondary tags are added in the output LU from the LU that the lem/lemh is clipped from.
  • If the lem/lemh comes from a variable in the output then the stags come from the LU which the lemma comes from, by tracing its variable assignment in <let>.
  • No regression. Stream without secondary tags work as-is.
  • Works with MLUs.
  • If there is a lemq in the LU, sectags appear before the lemq. Even if the lemq comes from a variable.

Generator: Pull Request[edit]

  • Removes all trailing secondary tags from the input before giving it to FST matching.
  • For input without secondary tags it works as earlier with no regression.
  • All escaped characters are ignored inside secondary tags, as well as unescaped special characters ($,#,etc.) This applies for the tag prefix as well

This is needed for generation.