Difference between revisions of "Unicode issues"

From Apertium
Jump to navigation Jump to search
Line 14: Line 14:
 
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
 
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
   
=== Unicode normalization ===
+
=== Unicode normalisation ===
Unicode defines normalization algorithms to transform various semantically equivalent strings to a single form.
+
Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form.
   
; NFD (Normalization Form Canonical Decomposition)
+
; NFD (Normalisation Form Canonical Decomposition)
 
The strings are decomposed and characters are reordered.
 
The strings are decomposed and characters are reordered.
   
 
Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.
 
Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.
   
When normalized to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.
+
When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.
 
<pre>
 
<pre>
 
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
 
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
Line 35: Line 35:
 
</pre>
 
</pre>
   
; NFC (Normalization Form Canonical Composition)
+
; NFC (Normalisation Form Canonical Composition)
 
The strings are decomposed, characters are reordered and are composed
 
The strings are decomposed, characters are reordered and are composed
   
With the same example, all strings are normalized to a different form.
+
With the same example, all strings are normalised to a different form.
 
<pre>
 
<pre>
 
UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81
 
UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81

Revision as of 13:28, 5 October 2007

Some issues (potential and otherwise) with Unicode support.

Combining vs. pre-combined characters

When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined or combining characters. These look different in UTF-8, but the same to the user.


UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
      á         vs.      á 
      U+00E1    vs. U+0061 U+0301

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.

Unicode normalisation

Unicode defines normalisation algorithms to transform various semantically equivalent strings to a single form.

NFD (Normalisation Form Canonical Decomposition)

The strings are decomposed and characters are reordered.

Here's an example with 4 strings that are equivalent 'ẹ́', 'ẹ́', 'ẹ́' and 'ẹ́'. If they are not the same at your end you are missing a good font or the software you use does not support that part of Unicode.

When normalised to NFD all those 4 strings have the same form and can be used as they are meant to, with the same meaning.

UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0301 U+0323     -> U+0065 U+0323 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3      -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+00E9 U+0323            -> U+0065 U+0323 U+0301
UTF-8 0xE1 0xBA 0xB9 0xCC 0x81 -> 0x65 0xCC 0xA3 0xCC 0x81
      ẹ́                        -> ẹ́
      U+1EB9 U+0301            -> U+0065 U+0323 U+0301
NFC (Normalisation Form Canonical Composition)

The strings are decomposed, characters are reordered and are composed

With the same example, all strings are normalised to a different form.

UTF-8 0x65 0xCC 0xA3 0xCC 0x81 -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0323 U+0301     -> U+1EB9 U+0301
UTF-8 0x65 0xCC 0x81 0xCC 0xA3 -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+0065 U+0301 U+0323     -> U+1EB9 U+0301
UTF-8 0xC3 0xA9 0xCC 0xA3      -> 0xE1 0xBA 0xB9 0xCC 0x81
      ẹ́                        -> ẹ́
      U+00E9 U+0323            -> U+1EB9 U+0301

Zero-width non-joiner (ZWNJ)