Orthographic normalisation
Revision as of 21:02, 24 September 2007 by Francis Tyers (talk | contribs)
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
- Romanian
Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.
- ţ → ț
- ş → ș
- Lingala
Pre-composed vs. composed unicode characters.