Rethinking tokenisation in the pipeline - Revision history

Unhammer at 07:12, 24 February 2023

2023-02-24T07:12:13Z

Francis Tyers at 11:49, 27 June 2020

2020-06-27T11:49:43Z

Francis Tyers: Created page with "At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words. This has some advantage..."

2020-06-27T11:39:52Z

Created page with "At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words. This has some advantage..."

New page

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...
* It means that multiwords can be included in the dictionary and tokenised as a single unit
* It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:
* Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
* Ambiguous paths need to be encoded in the lexicon (using <tt>+</tt>)

Historically,
* We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
* Unicode support was not as widely used

== Thoughts ==

There should be various stages to tokenisation,
* separating whitespace from non-whitespace, no more should we have e.g. <tt>^Bala$š^evi$ć</tt>
* morphologically analysing tokens, including those which only appear as part of a multiword
** contractions (1 token, >1 word)
** compounds (1 token, >1 word)
* retokenisation for multiwords
** (>1 token, 1 word)

← Older revision		Revision as of 07:12, 24 February 2023
Line 28:		Line 28:

	We should also be able to deal with languages that don't write spaces.		We should also be able to deal with languages that don't write spaces.
			[[Category:Tokenisation]]

@@ Line 12: / Line 12: @@
 * We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
 * Unicode support was not as widely used
+== Current issues ==
+* Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>.
 == Thoughts ==
@@ Line 22: / Line 26: @@
 * retokenisation for multiwords
 ** (>1 token, 1 word)
+We should also be able to deal with languages that don't write spaces.