Difference between revisions of "Orthographic normalisation"
Jump to navigation
Jump to search
(New page: A module to do orthographic normalisation on input streams would be nice. For example ;Romanian * ţ → ț * ş → ș ;Lingala) |
|||
Line 1: | Line 1: | ||
A module to do orthographic normalisation on input streams would be nice. |
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. |
||
For example |
|||
;Romanian |
;Romanian |
||
Romanian has two characters that should be written with ''commas'' but are often (probably over 90% of text "in the wild") written with ''cedillas''. An orthographic normalisation module would convert the legacy version into the new version. |
|||
* ţ → ț |
* ţ → ț |
||
Line 9: | Line 9: | ||
;Lingala |
;Lingala |
||
Pre-composed vs. composed unicode characters. |
Revision as of 21:02, 24 September 2007
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
- Romanian
Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.
- ţ → ț
- ş → ș
- Lingala
Pre-composed vs. composed unicode characters.