Difference between revisions of "Orthographic normalisation"
(Category:Documentation in English) |
|||
(7 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. |
+ | A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the [[ACX format]] which allows the definition of sets of equivalent characters. |
+ | ;Serbo-Croatian |
||
− | ;Romanian |
||
+ | There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted: |
||
− | Romanian has two characters that should be written with ''commas'' but are often (probably over 90% of text "in the wild") written with ''cedillas''. An orthographic normalisation module would convert the legacy version into the new version. |
||
− | * |
+ | * dž ←→ dž |
− | * |
+ | * lj ←→ lj |
+ | * nj ←→ nj |
||
+ | |||
+ | ;Afrikaans |
||
+ | |||
+ | The indefinite article in Afrikaans is "'n". This can be written a number of different ways: |
||
+ | |||
+ | *'n U+0027 U+006E |
||
+ | *‘n U+2018 U+006E |
||
+ | *ʼn U+0149 |
||
+ | *’n U+2019 U+006E |
||
+ | |||
+ | This ideally needs to be merged into one form. |
||
;Lingala |
;Lingala |
||
+ | {{see-also|Unicode issues}} |
||
− | Pre-composed vs. composed unicode characters. |
||
+ | When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as ''pre-composed'') or combining characters. These look different when encoded in UTF-8, but the same to the user. |
||
+ | |||
+ | <pre> |
||
+ | |||
+ | UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 |
||
+ | á vs. á |
||
+ | U+00E1 vs. U+0061 U+0301 |
||
+ | </pre> |
||
+ | |||
+ | The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar. |
||
+ | |||
+ | |||
+ | [[Category:Development]] |
||
+ | [[Category:Documentation in English]] |
Latest revision as of 11:43, 24 March 2012
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others. Certain forms of orthographic normalisation (for example for Romanian ș and ş) can be done with the ACX format which allows the definition of sets of equivalent characters.
- Serbo-Croatian
There are a couple of special characters in Serbo-Croatian which can be written with two characters or one character, a decision must be made which version to use in the dictionaries, and then forms not like this need to be converted:
- dž ←→ dž
- lj ←→ lj
- nj ←→ nj
- Afrikaans
The indefinite article in Afrikaans is "'n". This can be written a number of different ways:
- 'n U+0027 U+006E
- ‘n U+2018 U+006E
- ʼn U+0149
- ’n U+2019 U+006E
This ideally needs to be merged into one form.
- Lingala
- See also: Unicode issues
When a character has an accent, sometimes there is more than one way of representing it, using either pre-combined (sometimes referred to as pre-composed) or combining characters. These look different when encoded in UTF-8, but the same to the user.
UTF-8 0xC3 0xA0 vs. 0x61 0xCC 0x81 á vs. á U+00E1 vs. U+0061 U+0301
The best thing to do is probably standardise on one variant for analysis/generation, and then normalise all input coming into the analyser using a transliterator or something similar.