Difference between revisions of "Unicode issues"

From Apertium
Jump to navigation Jump to search
(New page: Some issues (potential and otherwise) with Unicode support. ==Combining vs. pre-combined characters== When a character has an accent, sometimes there is more than one way of representing...)
 
(Category:Documentation in English)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOCD}}
Some issues (potential and otherwise) with Unicode support.
Some issues (potential and otherwise) with Unicode support.


Line 6: Line 7:


<pre>
<pre>

UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
á vs. á
á vs. á
U+00E1 vs. U+0061 U+0301
</pre>
</pre>


The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.

=== Unicode normalisation ===
Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form.

; NFD (Normalisation Form Canonical Decomposition)
The strings are decomposed and characters are reordered.

Here's an example with 4 strings that are equivalent 'e&#769;&#803;', 'é&#803;', 'ẹ&#769;' and 'e&#803;&#769;'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.

When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.
<pre>
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
e&#769;&#803; -> e&#803;&#769;
U+0065 U+0301 U+0323 -> U+0065 U+0323 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
é&#803; -> e&#803;&#769;
U+00E9 U+0323 -> U+0065 U+0323 U+0301
UTF-8 0xE1 0xBA 0xB9 0xCC 0x81 -> 0x65 0xCC 0xA3 0xCC 0x81
ẹ&#769; -> e&#803;&#769;
U+1EB9 U+0301 -> U+0065 U+0323 U+0301
</pre>

; NFC (Normalisation Form Canonical Composition)
The strings are decomposed, characters are reordered and are composed

With the same example, all strings are normalised to a different form.
<pre>
UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81
e&#803;&#769; -> ẹ&#769;
U+0065 U+0323 U+0301 -> U+1EB9 U+0301
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81
e&#769;&#803; -> ẹ&#769;
U+0065 U+0301 U+0323 -> U+1EB9 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81
é&#803; -> ẹ&#769;
U+00E9 U+0323 -> U+1EB9 U+0301
</pre>

==Zero-width non-joiner (ZWNJ)==


==See also==
* [[Orthographic normalisation]]

[[Category:Development]]
[[Category:Documentation in English]]

Latest revision as of 11:46, 24 March 2012

Some issues (potential and otherwise) with Unicode support.

Combining vs. pre-combined characters[edit]

When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined or combining characters. These look different in UTF-8, but the same to the user.


UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
      á         vs.      á 
      U+00E1    vs. U+0061 U+0301

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.

Unicode normalisation[edit]

Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form.

NFD (Normalisation Form Canonical Decomposition)

The strings are decomposed and characters are reordered.

Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.

When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.

UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0301 U+0323     -> U+0065 U+0323 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3      -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+00E9 U+0323            -> U+0065 U+0323 U+0301
UTF-8 0xE1 0xBA 0xB9 0xCC 0x81 -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+1EB9 U+0301            -> U+0065 U+0323 U+0301
NFC (Normalisation Form Canonical Composition)

The strings are decomposed, characters are reordered and are composed

With the same example, all strings are normalised to a different form.

UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0323 U+0301     -> U+1EB9 U+0301
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0301 U+0323     -> U+1EB9 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3      -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+00E9 U+0323            -> U+1EB9 U+0301

Zero-width non-joiner (ZWNJ)[edit]

See also[edit]