Difference between revisions of "Monodix basics"

From Apertium
Jump to navigation Jump to search
Line 61: Line 61:
 
|}
 
|}
   
  +
Pairs of surface forms and lexical units in Apertium are indicated by the <code><p></code> element. This is rather intuitive, so long as you know the abbreviation! These pair elements may contain a "left side" (<code><l></code>) and a "right side" (<code><r></code>). The left side almost always contains the surface form of the word, while the right side contains the lexical unit. So, our first entry (<code><e></code>) might look something like the following:
===Analysis===
 
  +
  +
<pre>
  +
<e>
  +
<p>
  +
<l>beer</l>
  +
<r>beer<s n="noun"/><s n="singular"/></r>
  +
</p>
  +
</e>
  +
</pre>
   
 
==Notes==
 
==Notes==

Revision as of 18:45, 6 December 2007

We've been told that the Apertium format for dictionaries is rather counter-intuitive, which is fair enough if you're not used to thinking of dictionaries in a particular way. This page hopes to be a basic introduction to how they work and how you can get started reading them, and hopefully writing them!

This page assumes you are comfortable with HTML and XML, and assumes you can distinguish an element from an attribute and what character data is. If you're wanting a quick re-cap, this should help:

<element attribute="value">character data</element>

If that doesn't make any sense, you should probably read up some more on XML.

Introduction

So, on a global level, the most basic dictionary needs three sections. We're going to, step by step, define a dictionary that will analyse and generate the English word "beer" and its plural form, "beers". The first section defines the alphabet that is used with the dictionary. This is fairly self-explanatory and will look something like:

  <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>

The second section defines the grammatical symbols[1] of the language you are working with. This is normally where people say, hang on... what are grammatical symbols? Well, they're pretty much ways of describing words, and the different forms that words can take, so I assume you know what the parts of speech[2] are. For example: nouns (house, beer, boat, cat, ...), and that you can distuingish them from adjectives (red, good, transparent, ...) and verbs (eat, multiply, write, ...). The way we specify these is as follows:

  <sdefs>
    <sdef n="noun"/>
    <sdef n="verb"/>
    <sdef n="adjective"/>
  </sdefs>

People often complain about the brevity of the tags, and typically even the values are abbreviated, so noun becomes "n", verb becomes "vb" and adjective becomes "adj". The brevity serves a purpose however, when you're writing, or copying you want the tags to get in the way as little as possible. For reference, <sdef> means "symbol definition", and <sdefs> is simply this in the plural.

After we've specified the alphabet and symbols, we need to specify the actual words, the important part of the dictionary! To hold the words we use a section. There can be more than one section in a dictionary, and there are more than one type of section. We will not go into the details here, but traditionally the largest section is called "main" and is of the "standard" type.

  <section id="main" type="standard">

  </section>

The next step is to add an entry. This is slightly more involved, so please, read on...

Entries

The monolingual dictionaries in Apertium are morphological[3] dictionaries, this means that they not only hold words, but they also hold how they inflect, and what it means when they inflect. Within the dictionary we can distinguish two main processes:

  1. Analysis — retrieving a lexical unit from the surface form of a word.
  2. Generation — producing the surface form of a word from the lexical unit.

Ok, now to explain lexical unit and surface form. Remember the example of "beer" and "beers"? We know that "beer" is a noun, we also know that it is in the singular, we also know that the only difference between "beer" and "beers" is that "beers" is in the plural. So, summarising this knowledge below, we find the following two facts:

  1. beer — is a singular noun,
  2. beers — is the plural form of the noun "beer".

What we mean by lexical unit is the combination of the lemma[4], e.g. "beer" and the grammatical symbols. In Apertium style these would be represented something like the following:

Surface form Lexical unit
beer beer<noun><singular>
beers beer<noun><plural>

Pairs of surface forms and lexical units in Apertium are indicated by the

element. This is rather intuitive, so long as you know the abbreviation! These pair elements may contain a "left side" (<l>) and a "right side" (<r>). The left side almost always contains the surface form of the word, while the right side contains the lexical unit. So, our first entry (<e>) might look something like the following:

    <e>
      <p>
        <l>beer</l>
        <r>beer<s n="noun"/><s n="singular"/></r>
      </p>
    </e>

Notes

  1. In other linguistic literature these are sometimes referred to as "features", or "categories" and "sub-categories".
  2. A part of speech (or lexical category, word class, lexical class, etc.) is a linguistic category of words, which is generally defined by the syntactic or morphological behaviour of the word in question. Common linguistic categories include noun and verb, among others. There are open word classes, which constantly acquire new members, and closed word classes, which acquire new members infrequently if at all.
  3. A morphological dictionary models the rules that govern the internal structure of words in a language. For example, speakers of English realise that the words "dog" and "dogs" are related, that "dogs" is to "dog" as "cats" is to "cat". The rules understood by the speaker reflect specific patterns and regularities in the way in which words are formed from smaller units and how those smaller units interact.
  4. The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionaries.