Language codes

From Apertium
Jump to navigation Jump to search

You may need to identify a language code. For example, you may wish to create a dictionary for a language not in Apertium repositories. If you invent a new language, or identify a new dialect, you may need to create a language code. Also, you may wish to develop a language pair which needs an updated title.


The short answer

The Apertium project attempts to use ISO 639-3 three-letter codes for languages. Some pairs still use two-letter codes. These codes should be considered legacy. They will eventually be moved to three-letter codes.

The complete list of codes under this standard is at SIL 639-3 codes.


The long answer

ISO 639‑3 attempts to cover all known, living or dead, spoken or written languages. The list has 7000 or more entries. The entries are for individual languages, not language groups. If the language you wish to work on is orthodox, the list will probably have an entry. For example, there is an entry for the historical language, 'Middle High German (ca. 1050-1500)'.

Despite the scholarship behind the list, it is possible you wish to develop languages for Apertium which have no entry. Languages gain an entry on the list if there is scholarly recognition and a body of literature. The dialect of the city of London called 'Cockney Rhyming Slang' is a variation on English, mainly substituting nouns. Cockney Rhyming Slang is well-known, but is/was spoken and has no original literature. So, it has no ISO 639-3 code.

To create a code

Can the language reasonably be regarded as a variation on a well-known language? If so, you can use the internet document [RFC https://tools.ietf.org/html/rfc5646 rfc5646]. This allows you to state how the language varies from the parent.

The original document is very long, so here's a simple version. Nowadays, the standard is leaning towards using ISO 639-3 codes to identify the language. Then you use extra tags, separated by hyphens, to identify the variant. If the tags are not relevant, they can be dropped. No tag should be over eight characters long. Here are the tags which will interest an Apertium developer,

langcode-script-region-variant

In that order. A long example,

zh-cmn-Hans-CN
(Chinese, Mandarin, Simplified script, as used in China)

Note how the region is upper-case. The standard is case-insensitive, but the region is often capitalised. This leads to the common and well-known (using the older, shorter, language codes) code,

en_US
English, US variation

Please note that the hyphen in the original documents is often converted to an underscore for use on machines. Apertium expects an underscore.

Some examples

A Cockney Rhyming Slang dictionary could be named as,

eng_cockney

If, on some long weekend, you wished to translate the language of the English writer Lewis Carroll into (Unicode) shorthand,

eng_shrthand_carroll

An invented language with very little parentage, such as Klingon, can be represented by the ISO 639-3 code 'mis' ('miscellaneous'). Klingon was recently given a code, 'tlh'. Before it gained a code, an Apertium dictionary may have been called,

mis_klingon

How to represent complex codes in a build

How do you represent these codes when naming Apertium folders? The usual method is,

apertium-swe-dan

The method (?) is to indicate tags using an underscore,

apertium-eng_cockney-eng_shrthand_carroll

Summary

For use with apertium tools, dictionaries should now start with a three character code (which should be regex [a-zA-Z]{3}). Variation 'Subtags' can be added. Subtags can be less than or equal to eight characters long. The language code and subtags must be separated by underscores.