Difference between revisions of "Orthographic normalisation"
Jump to navigation
Jump to search
(New page: A module to do orthographic normalisation on input streams would be nice. For example ;Romanian * ţ → ț * ş → ș ;Lingala) |
|||
Line 1: | Line 1: | ||
− | A module to do orthographic normalisation on input streams would be nice. |
+ | A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. |
− | |||
− | For example |
||
;Romanian |
;Romanian |
||
+ | |||
+ | Romanian has two characters that should be written with ''commas'' but are often (probably over 90% of text "in the wild") written with ''cedillas''. An orthographic normalisation module would convert the legacy version into the new version. |
||
* ţ → ț |
* ţ → ț |
||
Line 9: | Line 9: | ||
;Lingala |
;Lingala |
||
+ | |||
+ | Pre-composed vs. composed unicode characters. |
Revision as of 21:02, 24 September 2007
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
- Romanian
Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.
- ţ → ț
- ş → ș
- Lingala
Pre-composed vs. composed unicode characters.