Difference between revisions of "Orthographic normalisation"

From Apertium
Jump to navigation Jump to search
(New page: A module to do orthographic normalisation on input streams would be nice. For example ;Romanian * ţ → ț * ş → ș ;Lingala)
 
Line 1: Line 1:
A module to do orthographic normalisation on input streams would be nice.
+
A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.
 
For example
 
   
 
;Romanian
 
;Romanian
  +
  +
Romanian has two characters that should be written with ''commas'' but are often (probably over 90% of text "in the wild") written with ''cedillas''. An orthographic normalisation module would convert the legacy version into the new version.
   
 
* ţ → ț
 
* ţ → ț
Line 9: Line 9:
   
 
;Lingala
 
;Lingala
  +
  +
Pre-composed vs. composed unicode characters.

Revision as of 21:02, 24 September 2007

A module to do orthographic normalisation on input streams would be nice. Some are more complicated than others.

Romanian

Romanian has two characters that should be written with commas but are often (probably over 90% of text "in the wild") written with cedillas. An orthographic normalisation module would convert the legacy version into the new version.

  • ţ → ț
  • ş → ș
Lingala

Pre-composed vs. composed unicode characters.