User:Khannatanmai/Alternate stream modification
The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using wordbound blanks.
Contents
Original Proposal[edit]
The original proposal for modification to the apertium stream format involved using secondary tags in LUs that would carry an arbitrary amount of information. This information can help in a variety of tasks, the most important ones being: eliminating dictionary trimming, markup handling, alignments generation, etc.
The biggest benefit of such secondary tags is that to the best of their ability, they will move around with their LUs in transfer, and hence this gives us the ability to transport information through the entire pipeline. Transporting the surface form is a fundamental part of our solution to eliminate trimming, and transporting markup tags is essential to ensure accurate reordering. These secondary tags are a part of the Lexical Unit.
Format: ^lemma<tags><sec:tags>$
Objections[edit]
This proposal attracted several objections from the Apertium community. A large part of this discussion was about whether eliminating trimming was even necessary, but on this page I will contain the discussion to be just about this stream modification.
- Readability: Apertium currently has a very readable and debug-friendly stream. Adding a lot of secondary tags to each LU would definitely make the stream less readable, and hence make it harder to debug for the language pair creator. This is made worse by the <feature:value> ness of the secondary tags.
- Information availability: The transfer of information in the Apertium pipeline is largely on a need basis, which means that a ton of arbitrary information that a lot of people might not need will be available to the language developer.
- Unapertium: An explosion of non-linguistic information in the lexical unit doesn't seem very consistent with the way Apertium works. The information represented by these tags is mostly about the lexical unit and not necessarily linguistic information needed as part of the translation process. Information in the Lexical Unit should be linguistic information about the lexical unit".
- FST pattern matching: All modules that use FSTs for pattern matching need to be modified to ignore secondary tags.
Benefits[edit]
There are however several benefits of secondary tags, as pointed out by several people.
- Reading specific information: They give us the ability to attach information to each reading. This is especially helpful if we want to represent lttoolbox weights as secondary tags to reduce compile time of the dictionary.
- Attached to the LU: Being attached with the LU makes it easier to move them around with the LU in transfer, which is arguably the biggest bottle neck in transporting information through the pipeline.
Alternate Proposal[edit]
Given these objections, and after one-on-one discussions with language developers, there is an alternate proposal for transporting secondary information.
The solution is to have secondary information in word-bound blanks. These are distinguished from the usual blanks because these will be reordered during transfer whereas normal blanks would stay as they are. Of course, there is still the issue of reading-specific information, but in that regard, we can either have secondary tags for weights (which will either be redundant by the time we reach transfer, or will be shifted to the blank), or the weights can just use primary tags (it won't be ambiguous since no other tag has floating point numbers). All secondary information will be in word-bound blanks before the pipeline reaches transfer.
Update: All the secondary information will be in word-bound blanks, but each reading will have a special trailing tag which gives it a unique id in a window. Because of this, we can have reading specific information as secondary information if needed. The source reading ID will be copied over to the target reading ID automatically in the bidix lookup. Each LU will get an ID (1,2,3, etc.), and each reading inside an LU will get an id of the form token/reading (1/1, 1/2, 1/3, etc.). Compounds will get individual tags and not specific IDs (^abc<tags><!1>+xyz<tags><!2>$) because they will be broken up in pretransfer.
By the time we reach bidix lookup, there will only be one source reading, so we can then use ids of the form token/reading for the multiple TL readings that the bidix produces. This model will ensure that throughout the pipeline we can have unique IDs for each reading.
Updated Format: ^lemma<tags><!id>$[{..secondary_information..}]
Objections[edit]
Here are some pre-emptive objections:
- Information Access: It's harder to access information inside the blanks as opposed to secondary tags. But this is a practical issue, it can definitely be implemented.
- Philosophical: Blanks are used to store information that is ignored during the translation of the words. It is debatable that secondary information is this kind of information, but it's not entirely incompatible.
Benefits[edit]
- Readability: Readability benefits as all the information is in blanks and the LU is untouched.
- FST Pattern Matching: Blanks pass through most modules, so this won't be an issue anymore.
- Arbitrary amount of information: While secondary tags have the ability to store any amount of information, they reach a limit due to the capability of tags. With blanks, there is no limit to the amount and kind of information that is stored. In fact, we can store full markup tags instead of references. If we want to have word embeddings in the future, it wouldn't certainly be very apt to store them in secondary tags, but word bound blanks will do the job quite well.
- Not much regression: Since these are bound to LUs, they give almost all the benefits of secondary tags apart from reading specific information.
- Philosophically Apertium: LUs continue to have only linguistic information, and all secondary information goes in word bound blanks. It is more consistent with Apertium's philosophy about providing information.
- Compound/Contraction Problem:
^abcuxyz/abc<tags>+xyz<tags>$
->^abc<tags><sf:abcuxyz>$ ^xyz<tags><sf:abcuxyz>$
. With larger compounds this adds a lot of redundant information inside Lexical Units. Having this inside blanks:^abc<tags>$[{@sf:abcuxyz}] ^xyz<tags>$[{@sf:abcuxyz}]
leaves the LU untouched, and is better for language pair developers.
Formalism[edit]
The formalism of how the secondary information will be stored in these blanks can be discussed once there has been discussion about this idea.
Comparative Examples[edit]
Compounds & Surface Form[edit]
Secondary Tags:
^intrastruktuur<n><sf:intrastruktuurontwikkelingsplan>$ ^ontwikkelings<n><sf:intrastruktuurontwikkelingsplan>$ ^plan<n><sf:intrastruktuurontwikkelingsplan>$
Word bound blanks:
^intrastruktuur<n>$[{@sf:intrastruktuurontwikkelingsplan}] ^ontwikkelings<n>$[{@sf:intrastruktuurontwikkelingsplan$}] ^plan<n>$[{@sf:intrastruktuurontwikkelingsplan}]
Markup tags[edit]
Secondary Tags:
^intrastruktuur<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234>$ ^ontwikkelings<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234>$ ^plan<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234>$
Word bound blanks:
^intrastruktuur<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234}] ^ontwikkelings<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234}] ^plan<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234}]
Word bound blanks:
^intrastruktuur<n>$[{@sf:intrastruktuurontwikkelingsplan; @markup:<span class="foo" id="bar" xml:lang="ckt">}] ^ontwikkelings<n>$[{@sf:intrastruktuurontwikkelingsplan; @markup:<span class="foo" id="bar" xml:lang="ckt">}] ^plan<n>$[{@sf:intrastruktuurontwikkelingsplan; @markup:<span class="foo" id="bar" xml:lang="ckt">}]
Embeddings[edit]
Secondary Tags:
^intrastruktuur<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.01,0.05,0.67>$ ^ontwikkelings<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.97,0.50,0.23>$ ^plan<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.12,0.56,0.89>$
Word bound blanks:
^intrastruktuur<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234; emb:0.01,0.05,0.67}] ^ontwikkelings<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234, @emb:0.97,0.50,0.23}] ^plan<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234, @emb:0.12,0.56,0.89}]
IDs / Alignments[edit]
Secondary Tags:
^intrastruktuur<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.01,0.05,0.67><id:1>$ ^ontwikkelings<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.97,0.50,0.23><id:2>$ ^plan<n><sf:intrastruktuurontwikkelingsplan><t:s:s1234><emb:0.12,0.56,0.89><id:3>$
Word bound blanks:
^intrastruktuur<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234; @emb:0.01,0.05,0.67; @id:1}] ^ontwikkelings<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234; @emb:0.97,0.50,0.23; @id:2}] ^plan<n>$[{@sf:intrastruktuurontwikkelingsplan; @t:s:s1234; @emb:0.12,0.56,0.89; @id:3}]
Secondary Tags:
^de<pr><sf:del><id:11>/of<pr><sf:del><id:11>/from<pr><sf:del><id:11>$
Word bound blanks:
^de<pr>/of<pr>/from<pr>$[{@sf:del; @id:11}]
Weights[edit]
Secondary Tags:
^de<pr><sf:del><id:11><W:1.6787>/of<pr><sf:del><id:11><W:5.0984>/from<pr><sf:del><id:11><W:0.0065>$
Word bound blanks: Not Reading Specific
Feedback[edit]
I invite your suggestions, comments, and any feedback that can help :)