User:Khannatanmai/Alternate stream modification

From Apertium
< User:Khannatanmai
Revision as of 07:05, 1 June 2020 by Khannatanmai (talk | contribs) (Created page with "The original proposal for modification to the apertium stream format involved using secondary tags in LUs that would carry an arbitrary amount of information. This information...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The original proposal for modification to the apertium stream format involved using secondary tags in LUs that would carry an arbitrary amount of information. This information can help in a variety of tasks, the most important ones being: eliminating dictionary trimming, markup handling, alignments generation, etc.

The biggest benefit of such secondary tags is that to the best of their ability, they will move around with their LUs in transfer, and hence this gives us the ability to transport information through the entire pipeline. Transporting the surface form is a fundamental part of our solution to eliminate trimming, and transporting markup tags is essential to ensure accurate reordering. These secondary tags are a part of the Lexical Unit.

Format: ^lemma<tags><sec:tags>$

Objections

This proposal attracted several objections from the Apertium community. A large part of this discussion was about whether eliminating trimming was even necessary, but on this page I will contain the discussion to be just about this stream modification.

- Readability: Apertium currently has a very readable and debug-friendly stream. Adding a lot of secondary tags to each LU would definitely make the stream less readable, and hence make it harder to debug for the language pair creator. This is made worse by the <feature:value> ness of the secondary tags.

- Information availability: The transfer of information in the Apertium pipeline is largely on a need basis, which means that a ton of arbitrary information that a lot of people might not need will be available to the language developer.

- Unapertium: An explosion of non-linguistic information in the lexical unit doesn't seem very consistent with the way Apertium works. The information represented by these tags is mostly about the lexical unit and not necessarily linguistic information needed as part of the translation process. Information in the Lexical Unit should be linguistic information about the lexical unit".

- FST pattern matching: All modules that use FSTs for pattern matching need to be modified to ignore secondary tags.

Benefits

There are however several benefits of secondary tags, as pointed out by several people.

- Reading specific information: They give us the ability to attach information to each reading. This is especially helpful if we want to represent lttoolbox weights as secondary tags to reduce compile time of the dictionary.

- Attached to the LU: Being attached with the LU makes it easier to move them around with the LU in transfer, which is arguably the biggest bottle neck in transporting information through the pipeline.

Alternate Proposal