Difference between revisions of "Rethinking tokenisation in the pipeline"

Revision as of 11:39, 27 June 2020

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

It means that multiwords can be included in the dictionary and tokenised as a single unit
It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

We did not have apertium-separable or any other way of merging analyses outside of transfer.
Unicode support was not as widely used

Thoughts

There should be various stages to tokenisation,

separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
retokenisation for multiwords
- (>1 token, 1 word)

Difference between revisions of "Rethinking tokenisation in the pipeline"

Revision as of 11:39, 27 June 2020

Thoughts

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools