Difference between revisions of "Unicode issues"
(New page: Some issues (potential and otherwise) with Unicode support. ==Combining vs. pre-combined characters== When a character has an accent, sometimes there is more than one way of representing...) |
(Category:Documentation in English) |
||
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
Some issues (potential and otherwise) with Unicode support. |
Some issues (potential and otherwise) with Unicode support. |
||
Line 6: | Line 7: | ||
<pre> |
<pre> |
||
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 |
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 |
||
á vs. á |
á vs. á |
||
U+00E1 vs. U+0061 U+0301 |
|||
</pre> |
</pre> |
||
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar. |
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar. |
||
=== Unicode normalisation === |
|||
Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form. |
|||
; NFD (Normalisation Form Canonical Decomposition) |
|||
The strings are decomposed and characters are reordered. |
|||
Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode. |
|||
When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning. |
|||
<pre> |
|||
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+0065 U+0301 U+0323 -> U+0065 U+0323 U+0301 |
|||
UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+00E9 U+0323 -> U+0065 U+0323 U+0301 |
|||
UTF-8 0xE1 0xBA 0xB9 0xCC 0x81 -> 0x65 0xCC 0xA3 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+1EB9 U+0301 -> U+0065 U+0323 U+0301 |
|||
</pre> |
|||
; NFC (Normalisation Form Canonical Composition) |
|||
The strings are decomposed, characters are reordered and are composed |
|||
With the same example, all strings are normalised to a different form. |
|||
<pre> |
|||
UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+0065 U+0323 U+0301 -> U+1EB9 U+0301 |
|||
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+0065 U+0301 U+0323 -> U+1EB9 U+0301 |
|||
UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81 |
|||
ẹ́ -> ẹ́ |
|||
U+00E9 U+0323 -> U+1EB9 U+0301 |
|||
</pre> |
|||
==Zero-width non-joiner (ZWNJ)== |
|||
==See also== |
|||
* [[Orthographic normalisation]] |
|||
[[Category:Development]] |
|||
[[Category:Documentation in English]] |
Latest revision as of 11:46, 24 March 2012
Some issues (potential and otherwise) with Unicode support.
Combining vs. pre-combined characters[edit]
When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined or combining characters. These look different in UTF-8, but the same to the user.
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 á vs. á U+00E1 vs. U+0061 U+0301
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
Unicode normalisation[edit]
Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form.
- NFD (Normalisation Form Canonical Decomposition)
The strings are decomposed and characters are reordered.
Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.
When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81 ẹ́ -> ẹ́ U+0065 U+0301 U+0323 -> U+0065 U+0323 U+0301 UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81 ẹ́ -> ẹ́ U+00E9 U+0323 -> U+0065 U+0323 U+0301 UTF-8 0xE1 0xBA 0xB9 0xCC 0x81 -> 0x65 0xCC 0xA3 0xCC 0x81 ẹ́ -> ẹ́ U+1EB9 U+0301 -> U+0065 U+0323 U+0301
- NFC (Normalisation Form Canonical Composition)
The strings are decomposed, characters are reordered and are composed
With the same example, all strings are normalised to a different form.
UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81 ẹ́ -> ẹ́ U+0065 U+0323 U+0301 -> U+1EB9 U+0301 UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81 ẹ́ -> ẹ́ U+0065 U+0301 U+0323 -> U+1EB9 U+0301 UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81 ẹ́ -> ẹ́ U+00E9 U+0323 -> U+1EB9 U+0301