Rethinking tokenisation in the pipeline

From Apertium

Revision as of 07:12, 24 February 2023 by Unhammer (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

It means that multiwords can be included in the dictionary and tokenised as a single unit
It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

We did not have apertium-separable or any other way of merging analyses outside of transfer.
Unicode support was not as widely used

Current issues[edit]

Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.

Thoughts[edit]

There should be various stages to tokenisation,

separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
retokenisation for multiwords
- (>1 token, 1 word)

We should also be able to deal with languages that don't write spaces.

Retrieved from "https://wiki.apertium.org/w/index.php?title=Rethinking_tokenisation_in_the_pipeline&oldid=74189"

Tokenisation