Difference between revisions of "Rethinking tokenisation in the pipeline"

From Apertium
Jump to navigation Jump to search
(Created page with "At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words. This has some advantage...")
 
Line 12: Line 12:
 
* We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
 
* We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.
 
* Unicode support was not as widely used
 
* Unicode support was not as widely used
  +
  +
== Current issues ==
  +
  +
* Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>.
   
 
== Thoughts ==
 
== Thoughts ==
Line 22: Line 26:
 
* retokenisation for multiwords
 
* retokenisation for multiwords
 
** (>1 token, 1 word)
 
** (>1 token, 1 word)
  +
  +
We should also be able to deal with languages that don't write spaces.

Revision as of 11:49, 27 June 2020

At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.

This has some advantages ...

  • It means that multiwords can be included in the dictionary and tokenised as a single unit
  • It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.

However it has some disadvantages:

  • Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
  • Ambiguous paths need to be encoded in the lexicon (using +)

Historically,

  • We did not have apertium-separable or any other way of merging analyses outside of transfer.
  • Unicode support was not as widely used

Current issues

  • Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.

Thoughts

There should be various stages to tokenisation,

  • separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
  • morphologically analysing tokens, including those which only appear as part of a multiword
    • contractions (1 token, >1 word)
    • compounds (1 token, >1 word)
  • retokenisation for multiwords
    • (>1 token, 1 word)

We should also be able to deal with languages that don't write spaces.