Difference between revisions of "Orthographic normalisation"

Latest revision as of 11:43, 24 March 2012

A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the ACX format which allows the definition of sets of equivalent characters.

Serbo-Croatian

There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted:

dž ←→ ǆ
lj ←→ ǉ
nj ←→ ǌ

Afrikaans

The indefinite article in Afrikaans is "'n". This can be written a number of different ways:

'n U+0027 U+006E
‘n U+2018 U+006E
ŉ U+0149
’n U+2019 U+006E

This ideally needs to be merged into one form.

Lingala

See also: Unicode issues

When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as pre-composed) or combining characters. These look different when encoded in UTF-8, but the same to the user.


UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
      á         vs.      á 
      U+00E1    vs. U+0061 U+0301

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.

@@ Line 1: / Line 1: @@
-A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
+A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the [[ACX format]] which allows the definition of sets of equivalent characters.
+;Serbo-Croatian
-;Romanian
+There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted:
-Romanian has two characters that should be written with ''commas'' but are often (probably over 90% of text "in the wild") written with ''cedillas''. An orthographic normalisation module would convert the legacy version into the new version.
-* ţ → ț
+* dž ←→ ǆ
-* ş → ș
+* lj ←→ ǉ
+* nj ←→ ǌ
+;Afrikaans
+The indefinite article in Afrikaans is "'n". This can be written a number of different ways:
+*'n  U+0027 U+006E
+*‘n  U+2018 U+006E
+*ŉ   U+0149
+*’n  U+2019 U+006E
+This ideally needs to be merged into one form.
 ;Lingala
+{{see-also|Unicode issues}}
-Pre-composed vs. composed unicode characters.
+When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as ''pre-composed'') or combining characters. These look different when encoded in UTF-8, but the same to the user.
+<pre>
+UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
+      á         vs.      á
+      U+00E1    vs. U+0061 U+0301
+</pre>
+The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.
+[[Category:Development]]
+[[Category:Documentation in English]]

Difference between revisions of "Orthographic normalisation"

Latest revision as of 11:43, 24 March 2012

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools