User:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.

All discussions on IRC about this can be found in the discussion page of this wiki.


Rationale

To eliminate trimming, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Formalism

The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

  • This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed. Since secondary info is optional, this will be fully backwards compatible.
  • Secondary information tags will always be trailing.
  • The number of tags is already arbitrary so that helps.
  • The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules.

Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair.

What is secondary information and why does the Apertium stream need it?

Primary vs. Secondary Information

Pattern Matching in FSTs

Potential benefits

While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. These could be, but aren't limited to:

  • Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
  • Semantic information
  • Theta roles
  • Subcategorisation info
  • Dependency
  • Capitalisation case
  • Sentiment

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.

Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.

Modifications needed

The following modules will need no modification:

  • Deformatter
  • Morph Analyser: Doesn't need any modification since for now we aren't considering putting secondary info in the dix, and even if we did, it would work as-is.
  • Pre-transfer
  • Post-chunk
  • Post-generator
  • Reformatter

Some of the other modules' parsers need to be modified for the secondary tags and all the other modules need to be modified to be able to access the secondary info in the stream.

The next section will include a detailed account of the current stream input/output for each module, and what modifications are needed, if any.

Apertium stream at each module

INPUT: Los perros del chico corren rápido..

Morph Analyser

Output:

^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][ ]

The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.

POS Tagger

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Proposed Output:

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
  • Will need to modify code such that it can add trailing secondary tags (surface forms, markup tags, etc.)
  • Parser needs no modification

Pre transfer

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Output with modified input (i.e. it gave this output when I gave the modified POS tagger output):

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • It doesn't seem like the parser needs any modifications. Works like it's supposed to.
  • We could modify the code so that it can add and access secondary tags, but this can be discussed, as it doesn't seem like it really needs it.

Bidix Lookup

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]

  • Biltrans does it what should do - copies the secondary tags on the TL side.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream (might need based on bidix information).

Lexical Selection

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>$ ^.<sent><sf:.>/.<sent><sf:.>$ ^.<sent><sf:.>/.<sent><sf:.>$[][ ]

  • Doesn't seem like the lexical selection was used here - needs further experimentation.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Anaphora Resolution

Current Output:

^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][ ]

Output with modified input:

^El<det><def><m><pl><sf:Los>/The<det><def><m><pl><sf:Los>/$ ^perro<n><m><pl><sf:perros>/dog<n><m><pl><sf:perros>/$ ^de<pr><sf:del>/of<pr><sf:del>/from<pr><sf:del>/$ ^el<det><def><m><sg><sf:del>/the<det><def><m><sg><sf:del>/$ ^chico<n><m><sg><sf:chico>/boy<n><sg><sf:chico>/$ ^correr<vblex><pri><p3><pl><sf:corren>/run<vblex><pri><p3><pl><sf:corren>/$ ^rápido<adj><m><sg><sf:rápido>/fast<adj><sint><m><sg><sf:rápido>/$ ^.<sent><sf:.>/.<sent><sf:.>/$ ^.<sent><sf:.>/.<sent><sf:.>/$[][ ]

  • Anaphora Resolution wasn't used here but the idea is clear.
  • Code can be modified to put anaphora info as secondary tags instead of another separator.
  • The parser needs to be modified to be able to recognise secondary tags and access them.
  • It should be given the ability to add secondary tags in the stream.

Chunker (t1x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][ ]

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^default<default>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$ ^default<default>{^.<sent><sf:.>$}$ ^default<default>{^.<sent><sf:.>$}$[][ ]

  • It stops matching prepositions and delimiters. This is probably because the patterns were defined as "pr" instead of "pr.*". To deal with this issue we could modify transfer pattern matching such that it ignores secondary tags during pattern matching. This would assume we cannot add secondary tags in t1x, and is consistent with our policy to not add tags in dixes. This was discussed earlier in section outlining why we need secondary tags.
  • We also ignore secondary tags in pattern matching as FSTs are order dependent and secondary tags will not be. Most likely we will let the FSTs ignore secondary tags and make a new pattern matching system just for the secondary tags.
  • The parser will be modified to detect secondary tags and access them.
  • The output system of the transfer will need to be modified such that the secondary tags are output no matter what (since they won't be mentioned in the <out> section).
  • It will also be given the ability to add secondary tags in the output.

Output modified manually (Proposed output):

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]

Interchunk (t2x)

Current Output:

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][ ]

Output with modified input:

^Det_nom<SN><m><pl>{^the<det><def><3><sf:Los>$ ^dog<n><3><sf:perros>$}$ ^de<PREP>{^of<pr><sf:del>$}$ ^det_nom<SN><m><sg>{^the<det><def><3><sf:del>$ ^boy<n><3><sf:chico>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres><sf:corren>$}$ ^adj<SA><m><sg>{^fast<adj><sint><sf:rápido>$}$^punt<sent>{^.<sent><sf:.>$}$^punt<sent>{^.<sent><sf:.>$}$[][ ]

  • The order of the chunks didn't change here but it seems like it's not touching the secondary tags.
  • We will need to make a decision about whether we need secondary tags for chunks. If not, there's not much to change here.
  • If we do, then the parser will be modified to access secondary tags and add them in the stream if needed.

Postchunk (t3x)

Current Output:

^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][ ]

Output with modified input:

^The<det><def><pl><sf:Los>$ ^dog<n><pl><sf:perros>$ ^of<pr><sf:del>$ ^the<det><def><sg><sf:del>$ ^boy<n><sg><sf:chico>$ ^run<vblex><pres><sf:corren>$ ^fast<adj><sint><sf:rápido>$^.<sent><sf:.>$^.<sent><sf:.>$[][ ]

  • The role of the postchunker is to change the stream into a format the generator would accept, which it does even with secondary tags.
  • It doesn't seem like we need to give a postchunker access to secondary tags, so it doesn't need any modification.

Generator

The dogs of the boy run fast..[][
]