Difference between revisions of "Ideas for Google Summer of Code/Robust tokenisation"

From Apertium
Jump to navigation Jump to search
(Created page with " ==Task== * Update lttoolbox to be fully Unicode compliant with regards to alphabetical symbols. ==Coding challenge== * Remove all multiwords from an Apertium languag...")
 
 
(16 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Apertium has a custom tokenisation algorithm based on the alphabet that the dictioary writer writes in the dictionary file plus partially the characters found in the actual dictionary entries. This leads to some hard to understand problems in pipeline and especially when [[HFST]]-based analysers are used. Furthermore the tokenisation is rather suboptimal for languages where there is no non-word characters to separate words (e.g. whitespace). Also different white space, hyphen, zero-width characters etc. etc. are handled quite randomly.


Some examples:

* Names etc. with accent not in alphabet or dictionary: ''Müller'' should be one token even if ü does not appear in dictionary or alphabet
* Compounding strategies
** banana-door may be 1 or 2 tokens depending on dictionary writers preferences and should not be effected if - is unicode character MINUS-HYPHEN, HYPHEN or EN-DASH, a strategy must also consider if - is replaced with ZERO-WIDTH JOINER or even NON-BREAKING SPACE
** Can we combine compounds with multiwords / words-with-spaces in general? ( https://github.com/apertium/lttoolbox/pull/139 implemented support for a limited form)
** Can we implement "compound-only-right" – ie. have entries that are *only* allowed to end compounds, not appear as words on their own?
** Can we implement more nuanced compound restrictions in lttoolbox without going full flag diacritic, e.g. "ulve+jakt" (with -e- when it's the first word) but "prærie+ulv+jakt" (no -e- when it's a middle part)
** Can we allow limited reanalysis as compound, e.g. if "Vertriebsprinzip" is in the dix as one lemma but we [https://www.reddit.com/r/LanguageTechnology/comments/1bo7svi/rebuilding_german_compound_words/ still want to know that it is a compound] of vertrieb+prinzip (but where do we stop, what about ver+trieb, false analyses etc.)
* Support for spaceless ortographies / no-space scripts like Japanese, Thai (perhaps using "plugins", e.g. mecab for Japanese)

I have also found out that some languages abuse the tokenisation algorithms currently in use by defining characters like hyphens as not word characters effectively making apertium treat them like whitespace... this probably (I'm not sure I fully understand this hack) allows haphazard compounding like banana-aeroplane-combucha-mocca-latte kind-of-stuff but has already been problematic when improving the pipeline e.g. in gsoc 2020. The upgrade path for robust tokenisation has to however consider that users of this hack will still be able to work without regressions...


==Task==
==Task==


* Update [[lttoolbox]] to be fully Unicode compliant with regards to alphabetical symbols.
* Update [[lttoolbox]] to be fully Unicode compliant with regards to alphabetical symbols, we want "*Müller" and not "*Mu *ü *ller" regardless of the alphabet in .dix
** More or less fixed in https://github.com/apertium/lttoolbox/issues/81 ?
** Is this still unfixed in HFST?
* Support spaceless ortographies
** See https://github.com/chanlon1/tokenisation
** and [[Tokenisation for spaceless orthographies]]
* Allow dictionary developers more control over tokenisation


The final algorithm should be improvement upon current tokenisation so care needs to be taken that original ideas of ''inconditionals'', et. dictionary blocks, I suggest test-driven development for your plan.


==Coding challenge==
==Coding challenge==


1. Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.
* Remove all multiwords from an Apertium language pair and put them in an [[apertium-separable]] dictionary.

* Make sure that the output before/after is identical.
e.g.

<pre>
echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i
C s
X !
X
C I
C s

...
</pre>


2. Make a valiant effort to solve an open issue from [https://github.com/apertium/lttoolbox/issues/ lttoolbox], [https://github.com/hfst/hfst/issues/ hfst] or [https://github.com/apertium/apertium-separable/issues/ apertium-separable]. Note: These are challenging (that's why they're open issues), but they are exactly the kind of coding you will be working on for this project.

== Further readings ==

* https://github.com/hfst/hfst/blob/master/tools/src/hfst-tokenize.cc
* https://unicode.org/reports/tr29/
* [[Tokenisation_for_spaceless_orthographies]]


[[Category:Google Summer of Code|Robust tokenisation]]
[[Category:Ideas_for_Google Summer of Code|Robust tokenisation]]
[[Category:Tokenisation]]

Latest revision as of 12:31, 27 March 2024

Apertium has a custom tokenisation algorithm based on the alphabet that the dictioary writer writes in the dictionary file plus partially the characters found in the actual dictionary entries. This leads to some hard to understand problems in pipeline and especially when HFST-based analysers are used. Furthermore the tokenisation is rather suboptimal for languages where there is no non-word characters to separate words (e.g. whitespace). Also different white space, hyphen, zero-width characters etc. etc. are handled quite randomly.

Some examples:

  • Names etc. with accent not in alphabet or dictionary: Müller should be one token even if ü does not appear in dictionary or alphabet
  • Compounding strategies
    • banana-door may be 1 or 2 tokens depending on dictionary writers preferences and should not be effected if - is unicode character MINUS-HYPHEN, HYPHEN or EN-DASH, a strategy must also consider if - is replaced with ZERO-WIDTH JOINER or even NON-BREAKING SPACE
    • Can we combine compounds with multiwords / words-with-spaces in general? ( https://github.com/apertium/lttoolbox/pull/139 implemented support for a limited form)
    • Can we implement "compound-only-right" – ie. have entries that are *only* allowed to end compounds, not appear as words on their own?
    • Can we implement more nuanced compound restrictions in lttoolbox without going full flag diacritic, e.g. "ulve+jakt" (with -e- when it's the first word) but "prærie+ulv+jakt" (no -e- when it's a middle part)
    • Can we allow limited reanalysis as compound, e.g. if "Vertriebsprinzip" is in the dix as one lemma but we still want to know that it is a compound of vertrieb+prinzip (but where do we stop, what about ver+trieb, false analyses etc.)
  • Support for spaceless ortographies / no-space scripts like Japanese, Thai (perhaps using "plugins", e.g. mecab for Japanese)

I have also found out that some languages abuse the tokenisation algorithms currently in use by defining characters like hyphens as not word characters effectively making apertium treat them like whitespace... this probably (I'm not sure I fully understand this hack) allows haphazard compounding like banana-aeroplane-combucha-mocca-latte kind-of-stuff but has already been problematic when improving the pipeline e.g. in gsoc 2020. The upgrade path for robust tokenisation has to however consider that users of this hack will still be able to work without regressions...

Task[edit]

The final algorithm should be improvement upon current tokenisation so care needs to be taken that original ideas of inconditionals, et. dictionary blocks, I suggest test-driven development for your plan.

Coding challenge[edit]

1. Write a program that uses data from Unicode to classify characters in an input stream into alphabetic and non-alphabetic.

e.g.

echo "This! Is a tešt тест ** % test." | ./classify-symbols
C T
C h
C i 
C s 
X ! 
X  
C I 
C s

...


2. Make a valiant effort to solve an open issue from lttoolbox, hfst or apertium-separable. Note: These are challenging (that's why they're open issues), but they are exactly the kind of coding you will be working on for this project.

Further readings[edit]