Difference between revisions of "Rethinking tokenisation in the pipeline"
Jump to navigation
Jump to search
(Created page with "At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words. This has some advantage...") |
|||
Line 12: | Line 12: | ||
* We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer. |
* We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer. |
||
* Unicode support was not as widely used |
* Unicode support was not as widely used |
||
== Current issues == |
|||
* Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>. |
|||
== Thoughts == |
== Thoughts == |
||
Line 22: | Line 26: | ||
* retokenisation for multiwords |
* retokenisation for multiwords |
||
** (>1 token, 1 word) |
** (>1 token, 1 word) |
||
We should also be able to deal with languages that don't write spaces. |
Revision as of 11:49, 27 June 2020
At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.
This has some advantages ...
- It means that multiwords can be included in the dictionary and tokenised as a single unit
- It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.
However it has some disadvantages:
- Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
- Ambiguous paths need to be encoded in the lexicon (using +)
Historically,
- We did not have apertium-separable or any other way of merging analyses outside of transfer.
- Unicode support was not as widely used
Current issues
- Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.
Thoughts
There should be various stages to tokenisation,
- separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
- morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
- retokenisation for multiwords
- (>1 token, 1 word)
We should also be able to deal with languages that don't write spaces.