Difference between revisions of "Orthographic normalisation"
Line 10: | Line 10: | ||
;Lingala |
;Lingala |
||
{{see-also|Unicode issues}} |
|||
Pre-composed vs. composed unicode characters. |
|||
When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as ''pre-composed'') or combining characters. These look different when encoded in UTF-8, but the same to the user. |
|||
<pre> |
|||
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 |
|||
á vs. á |
|||
U+00E1 vs. U+0061 U+0301 |
|||
</pre> |
|||
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar. |
|||
[[Category:Development]] |
[[Category:Development]] |
Revision as of 13:29, 5 October 2007
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
- Romanian
Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.
- ţ → ț
- ş → ș
- Lingala
- See also: Unicode issues
When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as pre-composed) or combining characters. These look different when encoded in UTF-8, but the same to the user.
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 á vs. á U+00E1 vs. U+0061 U+0301
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.