Difference between revisions of "User:Khannatanmai/New Apertium stream format"

From Apertium
Jump to navigation Jump to search
 
(25 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<strong style="color:maroon;font-size:1.5em;>The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using [[User:Khannatanmai/Wordbound_blanks | wordbound blanks]]. </strong>

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.
Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.


Line 5: Line 7:


= Rationale =
= Rationale =
To eliminate trimming, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.
This project was in a way born out of the project to eliminate dictionary trimming. To do that, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.


However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that '''each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form.''' With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.
However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that '''each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form.''' With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Another concrete benefit of secondary tags is the ability to include information in the stream that isn't a pre-defined list. This is discussed in detail later.

= Benefits =
== Eliminating Dictionary Trimming ==
* Elaborated in [[User:Khannatanmai/Eliminating_Dictionary_Trimming]]
* Enables us to propagate surface form throughout the pipeline, which makes it possible to eliminate dictionary trimming

== Markup handling ==
Markup handling is a huge issue due to the fact that we can't attach arbitrary information in each lexical unit, such as markups, and hence when lexical units get moved around, the markups don't move with them.

Using secondary information we can attach markup information to the lexical units, and hence move them around with the LU during transfer.


= Formalism =
= Formalism =
Line 20: Line 34:
* The secondary tags contain a ":" that would help distinguish them from primary tags.
* The secondary tags contain a ":" that would help distinguish them from primary tags.


This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules.
This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules. Later you can see how this formalism looks at every step in the Apertium pipeline.


Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as '''making tags more versatile by creating a new kind of tags which have a feature:value pair.'''
Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as '''making tags more versatile by creating a new kind of tags which have a feature:value pair.'''

<pre>
# Lower-case prefixes
отец<n><sg><gen><@subj><§agent><sf:отца><s:human><s:kin><t:a:ef31><t:span:fcd32>

# Upper-case prefixes
отец<n><sg><gen><@subj><§agent><S:отца><E:human><E:kin><T:a:ef31><T:span:fcd32>

# Symbol prefixes
отец<n><sg><gen><@subj><§agent><%:отца><:human><:kin><!:a:ef31><!:span:fcd32>
</pre>


= What is secondary information and why does the Apertium stream need it? =
= What is secondary information and why does the Apertium stream need it? =


== Primary vs. Secondary Information ==
== Primary vs. Secondary Information ==
Adding the ability to have an arbitrary amount of information in the Apertium stream may seem redundant since we can already have as many tags as we want. However, there's a few limitations with the current apertium tags, which we will be calling '''primary tags''':
* They are order dependent (due to the nature of pattern matching in FSTs)
* They need to be a pre-defined list (See [[Tags]])


However, there's several types of information that aren't fit for pre-defined lists. They are open sets, such as surface forms or markup tags. Primary tags cannot deal with this kind of information, and hence the ability to deal with arbitrary information that doesn't need to be fully pre-defined makes the stream significantly more powerful.


=== Pattern Matching in FSTs ===
=== Pattern Matching in FSTs ===
Pattern matching in FSTs is pretty strict, and in several files (dix, bidix, t*x), if the users haven't written a ".*" at the end of their pattern, any input with secondary tags will not match, as these tags are always trailing. To deal with this, we have decided to make the FSTs ignore secondary tags throughout the pipe. FSTs are also order dependent, and secondary tags cannot have a pre-defined order due to the fact that they're supposed to handle an arbitrary amount of information.

Once the FSTs have ignored secondary tags, we will have a separate system to pattern match with secondary information. This will be discussed further in the Implementation section.


== Potential benefits ==
== Potential benefits ==
While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. These could be, but aren't limited to:
While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. '''The biggest benefit of secondary tags will be the ability to link information to LUs that aren't a pre-defined finite list.''' These could be, but aren't limited to:
* Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
* Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
* Semantic information
* Semantic information
* Theta roles
* Subcategorisation info
* Subcategorisation info
* Dependency
* Dependency
* Capitalisation case
* Sentiment

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.
We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.


'''Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.'''
'''Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.'''

= Proof of Concept and No regression =
I've talked earlier about the benefits as well as the potential benefits that will come from including secondary information in the Apertium stream. Apart from these benefits, this project also promises no regression and complete backwards compatibility.

Hundreds of language pair translation systems and several other systems work on the current apertium stream format, and hence any modification that leads to any possible regression is completely unacceptable. This is why the project will be following a test-driven development format. There are several decisions that I've taken after a thorough discussion with Apertium experts, which ensure that we can get adequate benefits of secondary information without affecting systems that don't or will not use secondary tags.

* Secondary tags in FSTs: Finite State Transducers are order dependent and pretty strict with pattern matching, and in several cases adding secondary tags to LUs would make them not match in FSTs. Due to this, we have decided to not include secondary tags anywhere we are using FSTs for pattern matching in the pipeline and have a different method of matching for secondary tags (discussed in Implementation). The FSTs will ignore secondary tags and hence ensure any regression in pattern matching.

* As part of this project, we aren't adding any secondary information in data files (monodix, bidix) to ensure that this works with old data files as well.

* Tags will be separated in our understanding of them - primary tags and secondary tags. The difference is discussed above. In sections where it's possible to refer to tags, such as clipping tags in transfer files, the definition(regex match) will be modified such that it only matches primary tags to ensure no regression in already written t*x files.

* Secondary tags are optional in the stream.

* Since the secondary tags come dynamically from the modules, i.e. they aren't present in the data files, this will work with old data files as well.

= Modifications needed =
= Modifications needed =
The following modules will need no modification:
The following modules will need no modification:
Line 58: Line 102:
== Apertium stream at each module ==
== Apertium stream at each module ==


'''INPUT: Los perros del chico corren rápido..'''
'''INPUT: Los perros del chico corren rápido.'''


=== Morph Analyser ===
=== Morph Analyser ===
Line 64: Line 108:
'''Output:'''
'''Output:'''


<code>
<pre>
^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][
^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][
]
]
</code>
</pre>


The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.
The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.
Line 74: Line 118:
'''Current Output:'''
'''Current Output:'''
<code>
<pre>
^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
]
]
</code>
</pre>


'''Proposed Output:'''
'''Proposed Output:'''


<code>
<pre>
^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
]
]
</code>
</pre>


* Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
* Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
Line 93: Line 137:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
]
]
</code>
</pre>


'''Output with modified input''' (i.e. it gave this output when I gave the modified POS tagger output):
'''Output with modified input''' (i.e. it gave this output when I gave the modified POS tagger output):


<code>
<pre>
^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
]
]
</code>
</pre>


* It doesn't seem like the parser needs any modifications. Works like it's supposed to.
* It doesn't seem like the parser needs any modifications. Works like it's supposed to.
Line 111: Line 155:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
]
]
</code>
</pre>


* Biltrans does it what should do - copies the secondary tags on the TL side.
* Biltrans does it what should do - copies the secondary tags on the TL side.
Line 130: Line 174:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
]
]
</code>
</pre>


* Doesn't seem like the lexical selection was used here - '''needs further experimentation.'''
* Doesn't seem like the lexical selection was used here - '''needs further experimentation.'''
Line 150: Line 194:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][
^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][
^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][
]
]
</code>
</pre>


* Anaphora Resolution wasn't used here but the idea is clear.
* Anaphora Resolution wasn't used here but the idea is clear.
Line 168: Line 212:


=== Chunker (t1x) ===
=== Chunker (t1x) ===
'''Output with original input(without secondary tags):'''
'''Current Output:'''


<code>
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Proposed output with modified input(with secondary tags):'''


<code>
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
]
</code>
</pre>


'''Output with modified input (Reached proposed output):'''
* It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching. This would assume we cannot add secondary tags in t1x, and is consistent with our policy to not add tags in dixes. This was discussed earlier in section outlining why we need secondary tags.
* We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.
* The parser will be modified to detect secondary tags and access them.
* The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section).
* It will also be given the ability to add secondary tags in the output.


<pre>
'''Output modified manually (Proposed output):'''

<code>
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
]
</code>
</pre>

* <s>It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching.</s>
* <s>We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.</s>
* The parser will be modified to detect secondary tags and access them.
* <s>The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section). These secondary tags in the TL LU will be taken from wherever the lemma of a tag comes from. If the lemma comes from a variable then wherever the variable gets the lemma, that's where the secondary tags will come from.</s>
* <s>It will also be given the ability to add secondary tags in the output.</s>
* <s>A pseudo-attribute will be added which gets a string of all secondary tags. Regex: <code>((?:<[^<>]+:[^<>]*>)+)</code> or <code>((<[^>]+:[^>]+>)+)</code> (courtesy [[User:popcorndude]]).</s>
* <s>The pseudo-attribute tags will be needed to modified such that it doesn't include secondary tags to ensure backwards compatibility.</s>


=== Interchunk (t2x) ===
=== Interchunk (t2x) ===
'''Current Output:'''
'''Current Output:'''
<code>
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
]
</code>
</pre>


* The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
* The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
Line 217: Line 263:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][
^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
]
]
</code>
</pre>


* The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
* The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
Line 235: Line 281:
'''Current Output:'''
'''Current Output:'''


<code>
<pre>
The dogs of the boy run fast..[][
The dogs of the boy run fast..[][
]
]
</code>
</pre>


'''Output with modified input:'''
'''Output with modified input:'''


<code>
<pre>
#The #dog #of #the #boy #run #fast#.#.[][
#The #dog #of #the #boy #run #fast#.#.[][
]
]
</code>
</pre>


* This happens since FSTs don't match the words to their surface forms due to the extra secondary tags. To deal with this, the FSTs will ignore secondary tags in their monodix matching.
* This happens since FSTs don't match the words to their surface forms due to the extra secondary tags. To deal with this, the FSTs will ignore secondary tags in their monodix matching.
* The parser will once again be modified to access secondary tags and work with them. The output has no tags so nothing to change there.
* The parser will once again be modified to access secondary tags and work with them. The output has no tags so nothing to change there.

= Implementation =
== Accessing Secondary Tags ==
After a thorough discussion, we decided that the implementation of secondary tags accessing will happen through a <code>flat_multimap<Tag,size_t></code>, where Tag is <code>{string_view prefix, string_view value}</code> and <code>size_t</code> is the position of the secondary tag in the list of secondary tags.

* This enables us to query tags using their prefix.
* It also preserves the position of the tags if a user should need it.

This will also be used to do pattern matching for secondary tags.

== Outputting Secondary Tags ==
Outputting secondary tags will use the current system of outputting tags, since they're still of the format <..>.

= Progress =
The first thing to do is to modify modules such that secondary tags pass through without any problems through the stream.
'''Further updates can be found [[User:Khannatanmai/Secondary_tags_features|here]].

== Transfer ==
'''9 May 2020'''

* Modified transfer.cc so that while pattern matching, secondary tags are ignored.

'''Earlier output:'''
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][
]
</pre>

'''New output:'''
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]
</pre>

'''11 May 2020'''

* Modified regex of tags so that it ignores secondary tags and added regex for stags in transfer_data.cc
* Transfer gets secondary tags from the LU where the lem or lemh comes from in the <code><lu>..</lu></code>, given that this lu block is in <code><out>..</out></code>. Still need to deal with variables.

'''Earlier output:'''
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]
</pre>

'''New output:'''
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]
</pre>

Note, the verb didn't get its secondary tag because it uses a variable to print the lemma.

'''12 May 2020'''

* Modified transfer so that a map is created between variable_name and secondary tags from the lemma that this variable clips (only if the variable clips a lem/lemh), which is then added in the output if the variable is called in output.

'''We have reached our desired transfer output.'''

'''Current output:'''
<pre>
^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
</pre>

* Input without secondary tags shows no regression in transfer output.
* Need to test more thoroughly with a one-step transfer system, with MLUs, and with more variables.


== Generator ==
* Modified the generator such that it removes all trailing secondary tags before giving the input to the FST.

'''Earlier Output:'''
<pre>
#The #dog #of #the #boy #run #fast#.#.[][
]
</pre>

'''New Output:'''
<pre>
The dogs of the boy run fast..[][
]
</pre>

'''Further updates can be found [[User:Khannatanmai/Secondary_tags_features|here]].

Latest revision as of 09:20, 17 July 2020

The secondary tags project was shelved as the need for reading specific secondary information wasn't established. This work continued as LU-bound secondary information using wordbound blanks.

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.

All discussions on IRC about this can be found in the discussion page of this wiki.


Rationale[edit]

This project was in a way born out of the project to eliminate dictionary trimming. To do that, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Another concrete benefit of secondary tags is the ability to include information in the stream that isn't a pre-defined list. This is discussed in detail later.

Benefits[edit]

Eliminating Dictionary Trimming[edit]

Markup handling[edit]

Markup handling is a huge issue due to the fact that we can't attach arbitrary information in each lexical unit, such as markups, and hence when lexical units get moved around, the markups don't move with them.

Using secondary information we can attach markup information to the lexical units, and hence move them around with the LU during transfer.

Formalism[edit]

The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

  • This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed. Since secondary info is optional, this will be fully backwards compatible.
  • Secondary information tags will always be trailing.
  • The number of tags is already arbitrary so that helps.
  • The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules. Later you can see how this formalism looks at every step in the Apertium pipeline.

Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair.

# Lower-case prefixes
отец<n><sg><gen><@subj><§agent><sf:отца><s:human><s:kin><t:a:ef31><t:span:fcd32>

# Upper-case prefixes
отец<n><sg><gen><@subj><§agent><S:отца><E:human><E:kin><T:a:ef31><T:span:fcd32>

# Symbol prefixes
отец<n><sg><gen><@subj><§agent><%:отца><:human><:kin><!:a:ef31><!:span:fcd32>

What is secondary information and why does the Apertium stream need it?[edit]

Primary vs. Secondary Information[edit]

Adding the ability to have an arbitrary amount of information in the Apertium stream may seem redundant since we can already have as many tags as we want. However, there's a few limitations with the current apertium tags, which we will be calling primary tags:

  • They are order dependent (due to the nature of pattern matching in FSTs)
  • They need to be a pre-defined list (See Tags)

However, there's several types of information that aren't fit for pre-defined lists. They are open sets, such as surface forms or markup tags. Primary tags cannot deal with this kind of information, and hence the ability to deal with arbitrary information that doesn't need to be fully pre-defined makes the stream significantly more powerful.

Pattern Matching in FSTs[edit]

Pattern matching in FSTs is pretty strict, and in several files (dix, bidix, t*x), if the users haven't written a ".*" at the end of their pattern, any input with secondary tags will not match, as these tags are always trailing. To deal with this, we have decided to make the FSTs ignore secondary tags throughout the pipe. FSTs are also order dependent, and secondary tags cannot have a pre-defined order due to the fact that they're supposed to handle an arbitrary amount of information.

Once the FSTs have ignored secondary tags, we will have a separate system to pattern match with secondary information. This will be discussed further in the Implementation section.

Potential benefits[edit]

While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. The biggest benefit of secondary tags will be the ability to link information to LUs that aren't a pre-defined finite list. These could be, but aren't limited to:

  • Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
  • Semantic information
  • Subcategorisation info
  • Dependency

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.

Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.

Proof of Concept and No regression[edit]

I've talked earlier about the benefits as well as the potential benefits that will come from including secondary information in the Apertium stream. Apart from these benefits, this project also promises no regression and complete backwards compatibility.

Hundreds of language pair translation systems and several other systems work on the current apertium stream format, and hence any modification that leads to any possible regression is completely unacceptable. This is why the project will be following a test-driven development format. There are several decisions that I've taken after a thorough discussion with Apertium experts, which ensure that we can get adequate benefits of secondary information without affecting systems that don't or will not use secondary tags.

  • Secondary tags in FSTs: Finite State Transducers are order dependent and pretty strict with pattern matching, and in several cases adding secondary tags to LUs would make them not match in FSTs. Due to this, we have decided to not include secondary tags anywhere we are using FSTs for pattern matching in the pipeline and have a different method of matching for secondary tags (discussed in Implementation). The FSTs will ignore secondary tags and hence ensure any regression in pattern matching.
  • As part of this project, we aren't adding any secondary information in data files (monodix, bidix) to ensure that this works with old data files as well.
  • Tags will be separated in our understanding of them - primary tags and secondary tags. The difference is discussed above. In sections where it's possible to refer to tags, such as clipping tags in transfer files, the definition(regex match) will be modified such that it only matches primary tags to ensure no regression in already written t*x files.
  • Secondary tags are optional in the stream.
  • Since the secondary tags come dynamically from the modules, i.e. they aren't present in the data files, this will work with old data files as well.

Modifications needed[edit]

The following modules will need no modification:

  • Deformatter
  • Morph Analyser: Doesn't need any modification since for now we aren't considering putting secondary info in the dix, and even if we did, it would work as-is.
  • Pre-transfer
  • Post-generator
  • Reformatter

Some of the other modules' parsers need to be modified for the secondary tags and all the other modules need to be modified to be able to access the secondary info in the stream.

The next section will include a detailed account of the current stream input/output for each module, and what modifications are needed, if any.

Apertium stream at each module[edit]

INPUT: Los perros del chico corren rápido.

Morph Analyser[edit]

Output:

^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][
]

The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.

POS Tagger[edit]

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
]

Proposed Output:

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
]
  • Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
  • Will need to modify code such that it can add trailing secondary tags (surface forms, markup tags, etc.)
  • Parser needs no modification

Pre transfer[edit]

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][
]

Output with modified input (i.e. it gave this output when I gave the modified POS tagger output):

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
]
  • It doesn't seem like the parser needs any modifications. Works like it's supposed to.
  • We could modify the code so that it can add and access secondary tags, but this can be discussed, as it doesn't seem like it really needs it.

Bidix Lookup[edit]

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
]
  • Biltrans does it what should do - copies the secondary tags on the TL side.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream (might need based on bidix information).

Lexical Selection[edit]

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
]
  • Doesn't seem like the lexical selection was used here - needs further experimentation.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Anaphora Resolution[edit]

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][
]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][
]
  • Anaphora Resolution wasn't used here but the idea is clear.
  • Code can be modified to put anaphora info as secondary tags instead of another separator.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Chunker (t1x)[edit]

Output with original input(without secondary tags):

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]

Proposed output with modified input(with secondary tags):

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]

Output with modified input (Reached proposed output):

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
  • It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching.
  • We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.
  • The parser will be modified to detect secondary tags and access them.
  • The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section). These secondary tags in the TL LU will be taken from wherever the lemma of a tag comes from. If the lemma comes from a variable then wherever the variable gets the lemma, that's where the secondary tags will come from.
  • It will also be given the ability to add secondary tags in the output.
  • A pseudo-attribute will be added which gets a string of all secondary tags. Regex: ((?:<[^<>]+:[^<>]*>)+) or ((<[^>]+:[^>]+>)+) (courtesy User:popcorndude).
  • The pseudo-attribute tags will be needed to modified such that it doesn't include secondary tags to ensure backwards compatibility.

Interchunk (t2x)[edit]

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
  • The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
  • We will need to make a decision about whether we need secondary tags for chunks. If not, there's not much to change here.
  • If we do, then the parser will be modified to access secondary tags and add them in the stream if needed.

Postchunk (t3x)[edit]

Current Output:

^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][
]

Output with modified input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
]
  • The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
  • It can be modified to access and work with secondary tags as well.

Generator[edit]

Current Output:

The dogs of the boy run fast..[][
]

Output with modified input:

 #The #dog #of #the #boy #run #fast#.#.[][
]
  • This happens since FSTs don't match the words to their surface forms due to the extra secondary tags. To deal with this, the FSTs will ignore secondary tags in their monodix matching.
  • The parser will once again be modified to access secondary tags and work with them. The output has no tags so nothing to change there.

Implementation[edit]

Accessing Secondary Tags[edit]

After a thorough discussion, we decided that the implementation of secondary tags accessing will happen through a flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} and size_t is the position of the secondary tag in the list of secondary tags.

  • This enables us to query tags using their prefix.
  • It also preserves the position of the tags if a user should need it.

This will also be used to do pattern matching for secondary tags.

Outputting Secondary Tags[edit]

Outputting secondary tags will use the current system of outputting tags, since they're still of the format <..>.

Progress[edit]

The first thing to do is to modify modules such that secondary tags pass through without any problems through the stream. Further updates can be found here.

Transfer[edit]

9 May 2020

  • Modified transfer.cc so that while pattern matching, secondary tags are ignored.

Earlier output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][
]

New output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]

11 May 2020

  • Modified regex of tags so that it ignores secondary tags and added regex for stags in transfer_data.cc
  • Transfer gets secondary tags from the LU where the lem or lemh comes from in the <lu>..</lu>, given that this lu block is in <out>..</out>. Still need to deal with variables.

Earlier output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]

New output:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$ ^punt<sent>{^.<sent><sf:.>$}$ ^punt<sent>{^.<sent><sf:.>$}$[][
]

Note, the verb didn't get its secondary tag because it uses a variable to print the lemma.

12 May 2020

  • Modified transfer so that a map is created between variable_name and secondary tags from the lemma that this variable clips (only if the variable clips a lem/lemh), which is then added in the output if the variable is called in output.

We have reached our desired transfer output.

Current output:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
]
  • Input without secondary tags shows no regression in transfer output.
  • Need to test more thoroughly with a one-step transfer system, with MLUs, and with more variables.


Generator[edit]

  • Modified the generator such that it removes all trailing secondary tags before giving the input to the FST.

Earlier Output:

 #The #dog #of #the #boy #run #fast#.#.[][
]

New Output:

The dogs of the boy run fast..[][
]

Further updates can be found here.