User:Khannatanmai/New Apertium stream format

From Apertium
Jump to navigation Jump to search

Here I will provide updates about the development of the new Apertium stream format, which will include an arbitrary amount of optional secondary information.

All discussions on IRC about this can be found in the discussion page of this wiki.


Rationale

To eliminate trimming, we need to modify the apertium stream format so that it can include the surface form of words as well. This would first need a formalism for a new stream format and then a modification to all the parsers in the pipeline.

However, if we are going to modify all the parsers to include the surface form in the lexical unit, in our discussion(can be found on the discussion page) we concluded that it will be a worthwhile exercise to modify the stream format such that each program can include and process an arbitrary amount of information in the apertium stream, not just the surface form. With this proposal we're trying to prepare the apertium stream for the future. Today we realised that we need the surface form in the stream, and tomorrow we might need semantic tags, sentiment tags, etc. If we don't do this now, we will have to modify all the parsers in the pipeline each time we need more information in the pipe. This is why it's a good idea to modify the parsers so that it can handle an arbitrary amount of information.

Formalism

The stream will now have primary information - all information available in the stream currently, such as lemma and analysis. It will also have optional secondary information, in a feature:value format. We discussed several possible syntax for this new stream format, and the one that seems the best is something like this:

^potato<n><pl><case:aa><sf:potatoes><other-prefix:other-value>/patata<n><f><pl><more:other>$

Note that case here refers to capitalisation, not morphological case which is already a tag and hence primary information.

  • This doesn't mess with the current stream format too much. The primary information syntax is unchanged, and not prefixed. Since secondary info is optional, this will be fully backwards compatible.
  • Secondary information tags will always be trailing.
  • The number of tags is already arbitrary so that helps.
  • The secondary tags contain a ":" that would help distinguish them from primary tags.

This is just an example, but the idea is that any program in the pipeline can add as well as read this secondary information from the stream, and in the future one can add any amount of information in the language models or the translation modules.

Instead of looking at this as modifying or extending the apertium stream format, we could also look at this as making tags more versatile by creating a new kind of tags which have a feature:value pair.

What is secondary information and why does the Apertium stream need it?

Primary vs. Secondary Information

Potential benefits

While optional secondary information in the stream sounds great, this project isn't just about abstract future benefits. As part of this project, after implementing this modification to the stream, we will experiment by including the surface form in the stream and avoid trimming, as described above. If the results are satisfactory, we can move on to other kinds of information. These could be, but aren't limited to:

  • Markup tags: If we can attach markup tags in the lexical unit, they will move around with the unit in transfer.
  • Semantic information
  • Theta roles
  • Subcategorisation info
  • Dependency
  • Capitalisation case
  • Sentiment

We would probably create a wiki page listing all types of secondary info, and the associated prefixes to be used with each of them. This list would be extendable based on the task.

Note that as part of this project, I will not be adding any secondary info to data files, such as monodix or bidix. The secondary information will only be information that programs can output in the stream and will mostly be dynamic. All current pipes will continue to work as-is, unmodified. All old data and files remain valid.

Modifications needed

The following modules will need no modification:

  • Deformatter
  • Morph Analyser: Doesn't need any modification since for now we aren't considering putting secondary info in the dix, and even if we did, it would work as-is.
  • Post-generator
  • Reformatter

Some of the other modules' parsers need to be modified for the secondary tags and all the other modules need to be modified to be able to access the secondary info in the stream.

The next section will include a detailed account of the current stream input/output for each module, and what modifications are needed, if any.

Apertium stream at each module

INPUT: Los perros del chico corren rápido..

Morph Analyser

Output:

^Los/El<det><def><m><pl>/Prpers<prn><pro><p3><m><pl>$ ^perros/perro<n><m><pl>$ ^del/de<pr>+el<det><def><m><sg>$ ^chico/chico<n><m><sg>$ ^corren/correr<vblex><pri><p3><pl>$ ^rápido/rápido<adj><m><sg>$^./.<sent>$^./.<sent>$[][ ]

The Morph Analyser takes the surface form of words as input and using the monodix, outputs the surface form, lemma and it's analysis. Since (for now) we aren't planning to put secondary tags in the dix, we don't need to modify the morph analyser for this project.

POS Tagger

Current Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>+el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Proposed Output:

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>+el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • Here I have manually added secondary tags (surface forms of the words), and for compounds I have added the surface form on both the parts.
  • Will need to modify code such that it can add trailing secondary tags (surface forms, markup tags, etc.)
  • Parser needs no modification

Pre transfer

Original Output:

^El<det><def><m><pl>$ ^perro<n><m><pl>$ ^de<pr>$ ^el<det><def><m><sg>$ ^chico<n><m><sg>$ ^correr<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>$^.<sent>$^.<sent>$[][ ]

Output with modified input (i.e. it gave this output when I gave the modified POS tagger output):

^El<det><def><m><pl><sf:Los>$ ^perro<n><m><pl><sf:perros>$ ^de<pr><sf:del>$ ^el<det><def><m><sg><sf:del>$ ^chico<n><m><sg><sf:chico>$ ^correr<vblex><pri><p3><pl><sf:corren>$ ^rápido<adj><m><sg><sf:rápido>$ ^.<sent><sf:.>$ ^.<sent><sf:.>$[][ ]

  • It doesn't seem like the parser needs any modifications. Works like it's supposed to.
  • We could modify the code so that it can add and access secondary tags, but this can be discussed, as it doesn't seem like it really needs it.

Bidix Lookup

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]

Lexical Selection

^El<det><def><m><pl>/The<det><def><m><pl>$ ^perro<n><m><pl>/dog<n><m><pl>$ ^de<pr>/of<pr>/from<pr>$ ^el<det><def><m><sg>/the<det><def><m><sg>$ ^chico<n><m><sg>/boy<n><sg>$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>$^.<sent>/.<sent>$^.<sent>/.<sent>$[][
]

Anaphora Resolution

^El<det><def><m><pl>/The<det><def><m><pl>/$ ^perro<n><m><pl>/dog<n><m><pl>/$ ^de<pr>/of<pr>/from<pr>/$ ^el<det><def><m><sg>/the<det><def><m><sg>/$ ^chico<n><m><sg>/boy<n><sg>/$ ^correr<vblex><pri><p3><pl>/run<vblex><pri><p3><pl>/$ ^rápido<adj><m><sg>/fast<adj><sint><m><sg>/$^.<sent>/.<sent>/$^.<sent>/.<sent>/$[][
]

Chunker (t1x)

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]

Interchunk (t2x)

^Det_nom<SN><m><pl>{^the<det><def><3>$ ^dog<n><3>$}$ ^de<PREP>{^of<pr>$}$ ^det_nom<SN><m><sg>{^the<det><def><3>$ ^boy<n><3>$}$ ^verbcj<SV><vblex><pri><p3><pl>{^run<vblex><pres>$}$ ^adj<SA><m><sg>{^fast<adj><sint>$}$^punt<sent>{^.<sent>$}$^punt<sent>{^.<sent>$}$[][
]

Postchunk (t3x)

^The<det><def><pl>$ ^dog<n><pl>$ ^of<pr>$ ^the<det><def><sg>$ ^boy<n><sg>$ ^run<vblex><pres>$ ^fast<adj><sint>$^.<sent>$^.<sent>$[][
]

Generator

The dogs of the boy run fast..[][
]