Difference between revisions of "Rethinking tokenisation in the pipeline"

Latest revision as of 07:12, 24 February 2023

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

It means that multiwords can be included in the dictionary and tokenised as a single unit
It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

We did not have apertium-separable or any other way of merging analyses outside of transfer.
Unicode support was not as widely used

Current issues[edit]

Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.

Thoughts[edit]

There should be various stages to tokenisation,

separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
retokenisation for multiwords
- (>1 token, 1 word)

We should also be able to deal with languages that don't write spaces.

@@ Line 12: / Line 12: @@
 * We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
 * Unicode support was not as widely used
+== Current issues ==
+* Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>.
 == Thoughts ==
@@ Line 22: / Line 26: @@
 * retokenisation for multiwords
 ** (>1 token, 1 word)
+We should also be able to deal with languages that don't write spaces.
+[[Category:Tokenisation]]

Difference between revisions of "Rethinking tokenisation in the pipeline"

Latest revision as of 07:12, 24 February 2023

Current issues[edit]

Thoughts[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools