Difference between revisions of "Orthographic normalisation"

From Apertium
Jump to navigation Jump to search
(New page: A module to do orthographic normalisation on input streams would be nice. For example ;Romanian * ţ → ț * ş → ș ;Lingala)
 
(Category:Documentation in English)
 
(8 intermediate revisions by one other user not shown)
Line 1: Line 1:
A module to do orthographic normalisation on input streams would be nice.
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the [[ACX format]] which allows the definition of sets of equivalent characters.


;Serbo-Croatian
For example


There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted:
;Romanian


* ţ ț
* ←→ dž
* ş ș
* lj ←→ lj
* nj ←→ nj

;Afrikaans

The indefinite article in Afrikaans is "'n". This can be written a number of different ways:

*'n U+0027 U+006E
*‘n U+2018 U+006E
*ʼn U+0149
*’n U+2019 U+006E

This ideally needs to be merged into one form.


;Lingala
;Lingala

{{see-also|Unicode issues}}
When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as ''pre-composed'') or combining characters. These look different when encoded in UTF-8, but the same to the user.

<pre>

UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
á vs. á
U+00E1 vs. U+0061 U+0301
</pre>

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.


[[Category:Development]]
[[Category:Documentation in English]]

Latest revision as of 11:43, 24 March 2012

A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the ACX format which allows the definition of sets of equivalent characters.

Serbo-Croatian

There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted:

  • dž ←→ dž
  • lj ←→ lj
  • nj ←→ nj
Afrikaans

The indefinite article in Afrikaans is "'n". This can be written a number of different ways:

  • 'n U+0027 U+006E
  • ‘n U+2018 U+006E
  • ʼn U+0149
  • ’n U+2019 U+006E

This ideally needs to be merged into one form.

Lingala
See also: Unicode issues

When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as pre-composed) or combining characters. These look different when encoded in UTF-8, but the same to the user.


UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81
      á         vs.      á 
      U+00E1    vs. U+0061 U+0301

The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.