User:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.

All discussions on IRC about this can be found in the discussion page of this wiki.


Rationale

This project was in a way born out of the project to eliminate dictionary trimming. To do that, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Another concrete benefit of secondary tags is the ability to include information in the stream that isn't a pre-defined list. This is discussed in detail later.

Formalism

The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

  • This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed. Since secondary info is optional, this will be fully backwards compatible.
  • Secondary information tags will always be trailing.
  • The number of tags is already arbitrary so that helps.
  • The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules. Later you can see how this formalism looks at every step in the Apertium pipeline.

Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair.

What is secondary information and why does the Apertium stream need it?

Primary vs. Secondary Information

Adding the ability to have an arbitrary amount of information in the Apertium stream may seem redundant since we can already have as many tags as we want. However, there's a few limitations with the current apertium tags, which we will be calling primary tags:

  • They are order dependent (due to the nature of pattern matching in FSTs)
  • They need to be a pre-defined list (See Tags)

However, there's several types of information that aren't fit for pre-defined lists. They are open sets, such as surface forms or markup tags. Primary tags cannot deal with this kind of information, and hence the ability to deal with arbitrary information that doesn't need to be fully pre-defined makes the stream significantly more powerful.

Pattern Matching in FSTs

Pattern matching in FSTs is pretty strict, and in several files (dix, bidix, t*x), if the users haven't written a ".*" at the end of their pattern, any input with secondary tags will not match, as these tags are always trailing. To deal with this, we have decided to make the FSTs ignore secondary tags throughout the pipe. FSTs are also order dependent, and secondary tags cannot have a pre-defined order due to the fact that they're supposed to handle an arbitrary amount of information.

Once the FSTs have ignored secondary tags, we will have a separate system to pattern match with secondary information. This will be discussed further in the Implementation section.

Potential benefits

While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. These could be, but aren't limited to:

  • Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
  • Semantic information
  • Subcategorisation info
  • Dependency

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.

Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.

Modifications needed

The following modules will need no modification:

  • Deformatter
  • Morph Analyser: Doesn't need any modification since for now we aren't considering putting secondary info in the dix, and even if we did, it would work as-is.
  • Pre-transfer
  • Post-generator
  • Reformatter

Some of the other modules' parsers need to be modified for the secondary tags and all the other modules need to be modified to be able to access the secondary info in the stream.

The next section will include a detailed account of the current stream input/output for each module, and what modifications are needed, if any.

Apertium stream at each module

INPUT: Los perros del chico corren rápido..

Morph Analyser

Output:

^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][ ]

The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.

POS Tagger

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Proposed Output:

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
  • Will need to modify code such that it can add trailing secondary tags (surface forms, markup tags, etc.)
  • Parser needs no modification

Pre transfer

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Output with modified input (i.e. it gave this output when I gave the modified POS tagger output):

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • It doesn't seem like the parser needs any modifications. Works like it's supposed to.
  • We could modify the code so that it can add and access secondary tags, but this can be discussed, as it doesn't seem like it really needs it.

Bidix Lookup

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]

  • Biltrans does it what should do - copies the secondary tags on the TL side.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream (might need based on bidix information).

Lexical Selection

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]

  • Doesn't seem like the lexical selection was used here - needs further experimentation.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Anaphora Resolution

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][ ]

  • Anaphora Resolution wasn't used here but the idea is clear.
  • Code can be modified to put anaphora info as secondary tags instead of another separator.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Chunker (t1x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][ ]

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][ ]

  • It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching. This would assume we cannot add secondary tags in t1x, and is consistent with our policy to not add tags in dixes. This was discussed earlier in section outlining why we need secondary tags.
  • We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.
  • The parser will be modified to detect secondary tags and access them.
  • The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section).
  • It will also be given the ability to add secondary tags in the output.

Output modified manually (Proposed output):

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]

Interchunk (t2x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][ ]

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]

  • The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
  • We will need to make a decision about whether we need secondary tags for chunks. If not, there's not much to change here.
  • If we do, then the parser will be modified to access secondary tags and add them in the stream if needed.

Postchunk (t3x)

Current Output:

^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][ ]

Output with modified input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][ ]

  • The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
  • It can be modified to access and work with secondary tags as well.

Generator

Current Output:

The dogs of the boy run fast..[][ ]

Output with modified input:

#The #dog #of #the #boy #run #fast#.#.[][

]

  • This happens since FSTs don't match the words to their surface forms due to the extra secondary tags. To deal with this, the FSTs will ignore secondary tags in their monodix matching.
  • The parser will once again be modified to access secondary tags and work with them. The output has no tags so nothing to change there.

Implementation

Accessing Secondary Tags

After a thorough discussion, we decided that the implementation of secondary tags accessing will happen through a flat_multimap<Tag,size_t>, where Tag is {string_view prefix, string_view value} and size_t is the position of the secondary tag in the list of secondary tags.

  • This enables us to query tags using their prefix.
  • It also preserves the position of the tags if a user should need it.

This will also be used to do pattern matching for secondary tags.

Outputting Secondary Tags

Outputting secondary tags will use the current system of outputting tags, since they're still of the format <..>.