Difference between revisions of "Rethinking tokenisation in the pipeline"

Revision as of 11:49, 27 June 2020

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

It means that multiwords can be included in the dictionary and tokenised as a single unit
It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

We did not have apertium-separable or any other way of merging analyses outside of transfer.
Unicode support was not as widely used

Current issues

Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.

Thoughts

There should be various stages to tokenisation,

separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
retokenisation for multiwords
- (>1 token, 1 word)

We should also be able to deal with languages that don't write spaces.

@@ Line 12: / Line 12: @@
 * We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
 * Unicode support was not as widely used
+== Current issues ==
+* Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>.
 == Thoughts ==
@@ Line 22: / Line 26: @@
 * retokenisation for multiwords
 ** (>1 token, 1 word)
+We should also be able to deal with languages that don't write spaces.

Difference between revisions of "Rethinking tokenisation in the pipeline"

Revision as of 11:49, 27 June 2020

Current issues

Thoughts

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools