Difference between revisions of "Rethinking tokenisation in the pipeline"
		
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
		
		
		
		
		
	
|  (Created page with "At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.  This has some advantage...") | |||
| (One intermediate revision by one other user not shown) | |||
| Line 12: | Line 12: | ||
| * We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.  | * We did not have [[apertium-separable]] or any other way of merging analyses outside of transfer.  | ||
| * Unicode support was not as widely used | * Unicode support was not as widely used | ||
| == Current issues == | |||
| * Splitting tokens that are alphabetic but not in the <tt><alphabet></tt>, e.g. <tt>^Bala$š^evi$ć</tt>. | |||
| == Thoughts == | == Thoughts == | ||
| Line 22: | Line 26: | ||
| * retokenisation for multiwords | * retokenisation for multiwords | ||
| ** (>1 token, 1 word) | ** (>1 token, 1 word) | ||
| We should also be able to deal with languages that don't write spaces. | |||
| [[Category:Tokenisation]] | |||
Latest revision as of 07:12, 24 February 2023
At the moment tokenisation is done longest-match left-to-right using the morphological analyser and an alphabet defined for breaking up unknown words.
This has some advantages ...
- It means that multiwords can be included in the dictionary and tokenised as a single unit
- It combines two parts of the pipeline that have complementary information, tokenisation and morphological analysis.
However it has some disadvantages:
- Not having a standard fixed tokenisation scheme makes using other resources harder, and makes it harder for other people to use our resources.
- Ambiguous paths need to be encoded in the lexicon (using +)
Historically,
- We did not have apertium-separable or any other way of merging analyses outside of transfer.
- Unicode support was not as widely used
Current issues[edit]
- Splitting tokens that are alphabetic but not in the <alphabet>, e.g. ^Bala$š^evi$ć.
Thoughts[edit]
There should be various stages to tokenisation,
- separating whitespace from non-whitespace, no more should we have e.g. ^Bala$š^evi$ć
- morphologically analysing tokens, including those which only appear as part of a multiword
- contractions (1 token, >1 word)
- compounds (1 token, >1 word)
 
- retokenisation for multiwords
- (>1 token, 1 word)
 
We should also be able to deal with languages that don't write spaces.

