Rethinking tokenisation in the pipeline

From Apertium
Jump to navigation Jump to search

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

  • It means that multiwords can be included in the dictionary and tokenised as a single unit
  • It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

  • Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
  • Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

  • We did not have apertium-separable or any other way of merging analyses outside of transfer.
  • Unicode support was not as widely used

Current issues[edit]

  • Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.

Thoughts[edit]

There should be various stages to tokenisation,

  • separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
  • morphologically analysing tokens, including those which only appear as part of a multiword
    • contractions (1 token, >1 word)
    • compounds (1 token, >1 word)
  • retokenisation for multiwords
    • (>1 token, 1 word)

We should also be able to deal with languages that don't write spaces.