Difference between revisions of "Orthographic normalisation"

Revision as of 13:29, 5 October 2007

A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.

Romanian

Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.

ţ → ț
ş → ș

Lingala

See also: Unicode issues

When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as pre-composed) or combining characters. These look different when encoded in UTF-8, but the same to the user.


UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
      á         vs.      á 
      U+00E1    vs. U+0061 U+0301

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.

@@ Line 10: / Line 10: @@
 ;Lingala
+{{see-also|Unicode issues}}
-Pre-composed vs. composed unicode characters.
+When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as ''pre-composed'') or combining characters. These look different when encoded in UTF-8, but the same to the user.
+<pre>
+UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
+      á         vs.      á
+      U+00E1    vs. U+0061 U+0301
+</pre>
+The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
 [[Category:Development]]

Difference between revisions of "Orthographic normalisation"

Revision as of 13:29, 5 October 2007

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools