Rethinking tokenisation in the pipeline
At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.
This has some advantages ...
- It means that multiwords can be included in the dictionary and tokenised as a single unit
- It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.
However it has some disadvantages:
- Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
- Ambiguous paths need to be encoded in the lexicon (using +)
- We did not have apertium-separable or any other way of merging analyses outside of transfer.
- Unicode support was not as widely used
- Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.
There should be various stages to tokenisation,
- separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
- morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
- retokenisation for multiwords
- (>1 token, 1 word)
We should also be able to deal with languages that don't write spaces.