Difference between revisions of "Orthographic normalisation"
Jump to navigation
Jump to search
Line 11: | Line 11: | ||
Pre-composed vs. composed unicode characters. |
Pre-composed vs. composed unicode characters. |
||
[[Category:Development]] |
Revision as of 13:25, 25 September 2007
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
- Romanian
Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.
- ţ → ț
- ş → ș
- Lingala
Pre-composed vs. composed unicode characters.