User:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.

All discussions on IRC about this can be found in the discussion page of this wiki.


This project was in a way born out of the project to eliminate dictionary trimming. To do that, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Another concrete benefit of secondary tags is the ability to include information in the stream that isn't a pre-defined list. This is discussed in detail later.


Eliminating Dictionary Trimming

Markup handling

Markup handling is a huge issue due to the fact that we can't attach arbitrary information in each lexical unit, such as markups, and hence when lexical units get moved around, the markups don't move with them.

Using secondary information we can attach markup information to the lexical units, and hence move them around with the LU during transfer.


The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:


Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

  • This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed. Since secondary info is optional, this will be fully backwards compatible.
  • Secondary information tags will always be trailing.
  • The number of tags is already arbitrary so that helps.
  • The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules. Later you can see how this formalism looks at every step in the Apertium pipeline.

Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair.


What is secondary information and why does the Apertium stream need it?

Primary vs. Secondary Information

Adding the ability to have an arbitrary amount of information in the Apertium stream may seem redundant since we can already have as many tags as we want. However, there's a few limitations with the current apertium tags, which we will be calling primary tags:

  • They are order dependent (due to the nature of pattern matching in FSTs)
  • They need to be a pre-defined list (See Tags)

However, there's several types of information that aren't fit for pre-defined lists. They are open sets, such as surface forms or markup tags. Primary tags cannot deal with this kind of information, and hence the ability to deal with arbitrary information that doesn't need to be fully pre-defined makes the stream significantly more powerful.

Pattern Matching in FSTs

Pattern matching in FSTs is pretty strict, and in several files (dix, bidix, t*x), if the users haven't written a ".*" at the end of their pattern, any input with secondary tags will not match, as these tags are always trailing. To deal with this, we have decided to make the FSTs ignore secondary tags throughout the pipe. FSTs are also order dependent, and secondary tags cannot have a pre-defined order due to the fact that they're supposed to handle an arbitrary amount of information.

Once the FSTs have ignored secondary tags, we will have a separate system to pattern match with secondary information. This will be discussed further in the Implementation section.

Potential benefits

While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. The biggest benefit of secondary tags will be the ability to link information to LUs that aren't a pre-defined finite list. These could be, but aren't limited to:

  • Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
  • Semantic information
  • Subcategorisation info
  • Dependency

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.

Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.

Proof of Concept and No regression

I've talked earlier about the benefits as well as the potential benefits that will come from including secondary information in the Apertium stream. Apart from these benefits, this project also promises no regression and complete backwards compatibility.

Hundreds of language pair translation systems and several other systems work on the current apertium stream format, and hence any modification that leads to any possible regression is completely unacceptable. This is why the project will be following a test-driven development format. There are several decisions that I've taken after a thorough discussion with Apertium experts, which ensure that we can get adequate benefits of secondary information without affecting systems that don't or will not use secondary tags.

  • Secondary tags in FSTs: Finite State Transducers are order dependent and pretty strict with pattern matching, and in several cases adding secondary tags to LUs would make them not match in FSTs. Due to this, we have decided to not include secondary tags anywhere we are using FSTs for pattern matching in the pipeline and have a different method of matching for secondary tags (discussed in Implementation). The FSTs will ignore secondary tags and hence ensure any regression in pattern matching.
  • As part of this project, we aren't adding any secondary information in data files (monodix, bidix) to ensure that this works with old data files as well.
  • Tags will be separated in our understanding of them - primary tags and secondary tags. The difference is discussed above. In sections where it's possible to refer to tags, such as clipping tags in transfer files, the definition(regex match) will be modified such that it only matches primary tags to ensure no regression in already written t*x files.
  • Secondary tags are optional in the stream.
  • Since the secondary tags come dynamically from the modules, i.e. they aren't present in the data files, this will work with old data files as well.

Modifications needed

The following modules will need no modification:

  • Deformatter
  • Morph Analyser: Doesn't need any modification since for now we aren't considering putting secondary info in the dix, and even if we did, it would work as-is.
  • Pre-transfer
  • Post-generator
  • Reformatter

Some of the other modules' parsers need to be modified for the secondary tags and all the other modules need to be modified to be able to access the secondary info in the stream.

The next section will include a detailed account of the current stream input/output for each module, and what modifications are needed, if any.

Apertium stream at each module

INPUT: Los perros del chico corren rápido.

Morph Analyser


^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][

The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.

POS Tagger

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][

Proposed Output:

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
  • Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
  • Will need to modify code such that it can add trailing secondary tags (surface forms, markup tags, etc.)
  • Parser needs no modification

Pre transfer

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][

Output with modified input (i.e. it gave this output when I gave the modified POS tagger output):

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][
  • It doesn't seem like the parser needs any modifications. Works like it's supposed to.
  • We could modify the code so that it can add and access secondary tags, but this can be discussed, as it doesn't seem like it really needs it.

Bidix Lookup

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
  • Biltrans does it what should do - copies the secondary tags on the TL side.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream (might need based on bidix information).

Lexical Selection

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][
  • Doesn't seem like the lexical selection was used here - needs further experimentation.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Anaphora Resolution

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][
  • Anaphora Resolution wasn't used here but the idea is clear.
  • Code can be modified to put anaphora info as secondary tags instead of another separator.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Chunker (t1x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][
  • It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching. This would assume we cannot add secondary tags in t1x, and is consistent with our policy to not add tags in dixes. This was discussed earlier in section outlining why we need secondary tags.
  • We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.
  • The parser will be modified to detect secondary tags and access them.
  • The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section). These secondary tags in the TL LU will be taken from wherever the lemma of a tag comes from. If the lemma comes from a variable then naturally it won't have any secondary tags.
  • It will also be given the ability to add secondary tags in the output.
  • A pseudo-attribute will be added which gets a string of all secondary tags. Regex: ((?:<[^<>]+:[^<>]*>)+) or ((<[^>]+:[^>]+>)+) (courtesy User:popcorndude).
  • The pseudo-attribute tags will be needed to modified such that it doesn't include secondary tags to ensure backwards compatibility.

Output modified manually (Proposed output):

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][

Interchunk (t2x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][
  • The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
  • We will need to make a decision about whether we need secondary tags for chunks. If not, there's not much to change here.
  • If we do, then the parser will be modified to access secondary tags and add them in the stream if needed.

Postchunk (t3x)

Current Output:

^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][

Output with modified input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][
  • The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
  • It can be modified to access and work with secondary tags as well.


Current Output:

The dogs of the boy run fast..[][

Output with modified input:

 #The #dog #of #the #boy #run #fast#.#.[][
  • This happens since FSTs don't match the words to their surface forms due to the extra secondary tags. To deal with this, the FSTs will ignore secondary tags in their monodix matching.
  • The parser will once again be modified to access secondary tags and work with them. The output has no tags so nothing to change there.


Accessing Secondary Tags

After a thorough discussion, we decided that the implementation of secondary tags accessing will happen through a flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} and size_t is the position of the secondary tag in the list of secondary tags.

  • This enables us to query tags using their prefix.
  • It also preserves the position of the tags if a user should need it.

This will also be used to do pattern matching for secondary tags.

Outputting Secondary Tags

Outputting secondary tags will use the current system of outputting tags, since they're still of the format <..>.