We've been told that the Apertium format for dictionaries is rather counterintuitive, which is fair enough if you're not used to thinking of dictionaries in a particular way. This page hopes to be a basic introduction to how they work and how you can get started reading and writing them!
This page assumes you are comfortable with HTML and XML, and assumes you can distinguish an element from an attribute and can recognise character data. If you want a quick recap, this should help:
- <element attribute="value">character data</element>
If that doesn't make sense, you should probably read up some more on XML.
On a global level, the most basic dictionary needs three sections. We're going to, step by step, define a dictionary that will analyse and generate the English word "beer" and its plural form, "beers". The first section defines the alphabet that is used with the dictionary. This is fairly self-explanatory; it will look something like:
The second section defines the grammatical symbols of the language you are working with. This is normally where people say: "Hang on. What are grammatical symbols?" Well, they're pretty much ways of describing words, and the different forms that words can take. I assume you know what the parts of speech are — for example, nouns (house, beer, boat, cat, ...) — and that you can distinguish them from adjectives (red, good, transparent, ...) and verbs (eat, multiply, write, ...). The way we specify these is as follows:
<sdefs> <sdef n="noun"/> <sdef n="verb"/> <sdef n="adjective"/> </sdefs>
People often complain about the brevity of the tags, and typically even the values are abbreviated, so noun becomes "n", verb becomes "vb" and adjective becomes "adj" etc. (see list of symbols for some common abbreviations). The brevity serves a purpose, however; when you're writing or copying, you want the tags to get in the way as little as possible. For reference,
<sdef> means "symbol definition", and
<sdefs> is simply this in the plural.
After we've specified the alphabet and symbols, we need to specify the actual words — the important part of the dictionary. To hold the words we use a section. There can be more than one section in a dictionary, and there is more than one type of section. We will not go into the details here, but traditionally, the largest section is called "main" and is of the "standard" type.
<section id="main" type="standard"> </section>
The next step is to add an entry. This is slightly more involved, so please read on.
The monolingual dictionaries in Apertium are morphological. This means that they not only hold words but also the ways that they inflect and what it means when they inflect. In Apertium we use the morphological dictionaries for two tasks:
- Analysis — retrieving all of the possible lexical units from the surface form of a word.
- Generation — producing the surface form of a word from the lexical unit.
Okay, now to explain lexical unit and surface form. Remember the example of "beer" and "beers"? We know that "beer" is a noun; we know that it is in the singular; we also know that the only difference between "beer" and "beers" is that "beers" is in the plural. Summarising this knowledge below, we find the following two facts:
- beer — is a singular noun;
- beers — is the plural form of the noun "beer".
What we mean by lexical unit is the combination of the lemma, e.g. "beer", and the grammatical symbols. The surface form of a word is the word as you read it. In Apertium style these would be represented something like the following:
Surface form Lexical unit beer beer<noun><singular> beers beer<noun><plural>
In order to convert between these two forms, we need to define them as a pair. Pairs of surface forms and lexical units in Apertium are indicated by the
<p> element. This is rather intuitive, so long as you know the abbreviation. These pair elements may contain a "left side" (
<l>) and a "right side" (
<r>). The left side almost always contains the surface form of the word, while the right side contains the lexical unit. Our first entry (
<e>) might look something like the following:
<e> <p> <l>beer</l> <r>beer<s n="noun"/><s n="singular"/></r> </p> </e>
Now, roughly, you need as many of these entries as there are surface forms in the language; however, the astute among you will have realised that creating entries for all the words in the language is an impossible task. The next section will show how this can be avoided, but in the meantime we now have enough information to compile our first dictionary:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> <sdef n="noun"/> <sdef n="singular"/> <sdef n="plural"/> </sdefs> <section id="main" type="standard"> <e> <p> <l>beer</l> <r>beer<s n="noun"/><s n="singular"/></r> </p> </e> <e> <p> <l>beers</l> <r>beer<s n="noun"/><s n="plural"/></r> </p> </e> </section> </dictionary>
The entries above will enable us to retrieve the lexical units for "beer" and "beers", and to generate these two surface forms from the same lexical units.
The dictionary is functional but is intended for teaching purposes; actual dictionary files look somewhat different, because defining each word completely separately from other words which follow the same rules is rather inefficient.
- See also: lttoolbox
Save this into a file called
dictionary.dix, then we'll compile the dictionary into a binary form using the tool
lt-comp. The command takes three arguments; the first is "direction", then the input file and the output file. The "direction" option is important.
If we specify the direction as "lr" (left → right), we get an analyser (that is, a dictionary that takes surface forms and outputs lexical units. If we specify the reverse ("rl", right → left), we get a generator, which takes lexical units and outputs surface forms. We might as well generate both:
$ lt-comp lr dictionary.dix analyser.bin main@standard 7 6 $ lt-comp rl dictionary.dix generator.bin main@standard 7 6
We can now use the dictionary to analyse the noun "beers":
$ echo "beers" | lt-proc analyser.bin ^beers/beer<noun><plural>$
The analysis gives us the surface form, followed by the lexical unit. If we want to generate the surface form from the lexical unit, we just do:
$ echo "^beer<noun><plural>$" | lt-proc -g generator.bin beers
Great! We have a dictionary, and we can analyse and generate the two forms of the words "beer". But what happens when we want to add more words, say "school" or "computer"? Well, one thing we could do is add four more entries in the main section (one for each of "school", "schools", "computer" and "computers"). On the other hand, this would be pretty inefficient. Instead, we can generalise a rule, which in this case is "add -s to make the plural", using a paradigm, which is literally, "an example serving as a model or pattern".
In order to define paradigms, we typically take a word that can serve as an example for how other words inflect. In this case, we can say, "the words school and computer inflect like beer".
Paradigms go in a section called
<pardefs> (paradigm definitions), below the
<sdefs> and above the main section. They are defined in
<pardef> (paradigm definition) elements. Each paradigm definition must have an attribute "id", which contains a unique name. This id can be anything, but it conventionally takes the form of:
<lemma>__<part of speech>, (e.g.
In order to make the lexical units for beer, beers, computer, computers, etc., we need to distinguish between the part of the surface form that doesn't change (the identical part), and the part that does change. In the example already given, it is quite straightforward that the identical part is always the singular form. However, this might not always be the case (e.g. "wolf, wolves" or "tooth, teeth").
You probably guessed already what the paradigm definition is going to look like, so here it is:
<pardef n="beer__n"> <e> <p> <l/> <r><s n="noun"/><s n="singular"/></r> </p> </e> <e> <p> <l>s</l> <r><s n="noun"/><s n="plural"/></r> </p> </e> </pardef>
The only thing that has changed between these two entries and the first ones we made is that the identical part has been removed from both sides of the pair.
The paradigm definition goes into its own part of the dictionary, enclosed in
<pardefs> tags; for example:
<pardefs> ... </pardefs>
We can see where this fits in with the rest of the dictionary below:
<dictionary> <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet> <sdefs> ... </sdefs> <pardefs> ... </pardefs> <section id="main" type="standard"> <e lm="beer"><i>beer</i><par n="beer__n"/></e> <e lm="school"><i>school</i><par n="beer__n"/></e> <e lm="computer"><i>computer</i><par n="beer__n"/></e> <e lm="house"><i>house</i><par n="beer__n"/></e> </section> </dictionary>
- Building dictionaries
- Contributing to an existing pair
- See Alphabet for how the alphabet affects blanks and tokenisation of unknown words.
- In other linguistic literature these are sometimes referred to as "features", or "categories" and "sub-categories".
- A part of speech (or lexical category, word class, lexical class, etc.) is a linguistic category of words, which is generally defined by the syntactic or morphological behaviour of the word in question. Common linguistic categories include noun and verb, among others. There are open word classes, which constantly acquire new members, and closed word classes, which acquire new members only infrequently, if at all.
- A morphological dictionary models the rules that govern the internal structure of words in a language. For example, speakers of English realise that the words "dog" and "dogs" are related, that "dogs" is to "dog" as "cats" is to "cat". The rules understood by the speaker reflect specific patterns and regularities in the way in which words are formed from smaller units and how those smaller units interact.
- The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionaries.
- Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses; for example, "run" can be a verb (They run on weekends) or a noun (I'm going for a run).
- See Dictionaries for more complete information on the format