Diccionariu morfolóxicu

From Apertium
Revision as of 19:35, 13 January 2009 by Senio (talk | contribs) (New page: {{TOCD}} Venimos diciendo que'l formatu de los diccionarios Apertium son un poco anti-intuitivos, que ye yá abondo si nun solía pensar nos diccionarios d'un mou determináu. Esta páxina...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Venimos diciendo que'l formatu de los diccionarios Apertium son un poco anti-intuitivos, que ye yá abondo si nun solía pensar nos diccionarios d'un mou determináu. Esta páxina enfótase en tratar de ser un entamu "básicu" a cómo funcionen y a preparalu pa lleelos y escribilos.

Esta paxina entiende que nun-y incomoda ver HTML y XML, y poro entiende que ye a distinguir un elementu d'un atributu o lo que ye'l conteníu d'una etiqueta. Si necesita un repás rápidu, valga esti:

<etiqueta atributu="valor">conteníu</valor>

Si esto nun tien nen, debería lleer un poco más sobre XML.

Entamu

Poro, a mou xeneral, el diccionariu más básicu precisa tres estayes. Vamos, pasín al pasu, a definir un diccionariu qu'analizará y xenerará el términu inglés "beer" y la so forma plural "beers". La primer estaya defín l'alfabetu que se va emplegar col diccionariu. Esto ye cenciellamente auto-desplicativo. Tien esti aspeutu:

  <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>

La segunda sición defín los símbolos gramaticales[1] de la llingua cola que ta trabayando. Equí ye onde la xente normalmente pue dicir, pero...¿qué son los signos gramaticales? Bono, hai milenta maneres de describir les pallabres y les distintes formes qu'estes puen garrar, asina qu'entiendo tamién que conoz cuáles son les "pieces de la llingua"[2]. Por exemplu: nomes (casa, cerveza, barcu, gatu...), pallabres que somos a estremar de los axetivos (colloráo, bono, tresparente...) y de los verbos (comer, multiplicar, escribir...). La manera d'especificar estes categoríes ye la siguiente:

  <sdefs>
    <sdef n="noun"/>
    <sdef n="verb"/>
    <sdef n="adjective"/>
  </sdefs>

La xente suel quexase de lo curties que son les etiquetes (como <sdef> por exemplu), pero amás los valores tamién suelen abreviase, y asina los nomes esprésense con "n", los verbos con "vb" y los axetivos con "adj" etc. (vea la llista de símbolos pa les abreviatures más comunes). Por embargu la brevedá sirve a un propósitu, que ye escribir eses etiquetes nel menor tiempu posible. Y ye que nun diccionariu hai munches entraes. Como referencia, <sdef> vien de "symbol definition" (definición de símbolu), y <sdefs> ye cenciellamente lo mesmo, pero en plural.

Depués d'especificar l'alfabetu y los símbolos, necesitamos especificar lo más importante ¡les pallabres del diccionariu! P'axuntar les pallabres usamos una estaya. Nun diccionariu pue haber más d'una estaya, y estes puen ser amás de distinta mena. Nun vamos entrar en detalles equí, namás vamos dicir que la estaya más grande llámase "main" (principal) y que ye de tipu "standard" (estándar).

  <section id="main" type="standard">

  </section>

El siguiente pasu ye añadi-y la primer entrada. Esto ye un poco más enguedeyao, pero nun hai problema...

Entraes

The monolingual dictionaries in Apertium are morphological[3] dictionaries, this means that they not only hold words, but they also hold how they inflect, and what it means when they inflect. In Apertium we use the morphological dictionaries for two tasks:

  1. Analysis — retrieving all of the possible lexical units from the surface form of a word.
  2. Generation — producing the surface form of a word from the lexical unit.

Ok, now to explain lexical unit and surface form. Remember the example of "beer" and "beers"? We know that "beer" is a noun, we also know that it is in the singular, we also know that the only difference between "beer" and "beers" is that "beers" is in the plural. So, summarising this knowledge below, we find the following two facts:

  1. beer — is a singular noun,
  2. beers — is the plural form of the noun "beer".

What we mean by lexical unit is the combination of the lemma[4], e.g. "beer" and the grammatical symbols. The surface form of a word is the word as you read it.[5] In Apertium style these would be represented something like the following:

Surface form Lexical unit
beer beer<noun><singular>
beers beer<noun><plural>

In order to convert between these two forms, we need to define them as a pair. Pairs of surface forms and lexical units in Apertium are indicated by the <p> element. This is rather intuitive, so long as you know the abbreviation! These pair elements may contain a "left side" (<l>) and a "right side" (<r>). The left side almost always contains the surface form of the word, while the right side contains the lexical unit. So, our first entry (<e>) might look something like the following:

    <e>
      <p>
        <l>beer</l>
        <r>beer<s n="noun"/><s n="singular"/></r>
      </p>
    </e>

Now, roughly, you need as many of these entries as there are surface forms in the language, however the astute among you will have realised that creating entries for all the words in the language is an impossible task. The next section will show how this can be avoided, but in the mean time we now have enough information to compile our first dictionary:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="noun"/>
    <sdef n="singular"/>
    <sdef n="plural"/>
  </sdefs>

  <section id="main" type="standard">
    <e>
      <p>
        <l>beer</l>
        <r>beer<s n="noun"/><s n="singular"/></r>
      </p>
    </e>
    <e>
      <p>
        <l>beers</l>
        <r>beer<s n="noun"/><s n="plural"/></r>
      </p>
    </e>
  </section>
</dictionary>

The entries above will enable us to retrieve the lexical units for "beer" and "beers", and generate these two surface forms from the same lexical units.

The dictionary is functional, but is intended for teaching purposes, actual dictionary files look somewhat different, because defining each word completely separately from other words which follow the same rules is rather inefficient.

Compilation

See also: lttoolbox

Save this into a file called dictionary.dix, then we'll compile the dictionary into a binary form[6] using the tool lt-comp. The command takes three arguments, the first is "direction", then input file and output file. The "direction" option is important.

If we specify the direction as "lr" (left → right), we get an analyser (that is, a dictionary that takes surface forms and outputs lexical units. If we specify the reverse ("rl", right → left), we get a generator, which takes lexical units and outputs surface forms. We might as well generate both:

$ lt-comp lr dictionary.dix analyser.bin
main@standard 7 6

$ lt-comp rl dictionary.dix generator.bin
main@standard 7 6

We can now use the dictionary to analyse the noun "beers":

$ echo "beers" | lt-proc analyser.bin
^beers/beer<noun><plural>$

The analysis gives us the surface form, followed by the lexical unit. Say we want to generate the surface form from the lexical unit, we just do:

$ echo "^beer<noun><plural>$" | lt-proc -g generator.bin
beers

Paradigms

So, great, we have a dictionary and we can analyse and generate the two forms of the words "beer". But what happens when we want to add more words, say "school", or "computer". Well, one thing we could do is just add four more entries in the main section (one for each of "school", "schools", "computer" and "computers"). On the other hand, this would be pretty inefficient. Instead, we can generalise a rule, which in this case is "add -s to make the plural", using a paradigm, which is literally, "an example serving as a model or pattern".

In order to define paradigms, we typically take a word that can serve as an example for how other words inflect. In this case, we can say, "the words school and computer inflect like beer".

Paradigms go in a section called <pardefs> (paradigm definitions), below the <sdefs> and above the main section. They are defined in <pardef> (paradigm definition) elements. Each paradigm definition must have an attribute "id", which contains a unique name. This id can be anything, but conventionally takes the form of:

<lemma>__<part of speech>, (e.g. beer__n)

In order to make the lexical units for beer, beers, computer, computers, etc... we need to distinguish between the part of the surface form that doesn't change (the identical part), and the part that does change. In the example already given, it is quite straightforward that the identical part is always the singular form. However, this might not always be the case (e.g. "wolf, wolves" or "tooth, teeth").

You probably guessed already what the paradigm definition is going to look like, so here it is:

    <pardef n="beer__n">
      <e>
        <p>
          <l/>
          <r><s n="noun"/><s n="singular"/></r>
        </p>
      </e>
      <e>
        <p>
          <l>s</l>
          <r><s n="noun"/><s n="plural"/></r>
        </p>
      </e>
    </pardef>

The only thing that has changed between these two entries, and the first ones we made is that the identical part has been removed from both sides of the pair.

The paradigm definition goes into its own part of the dictionary, enclosed in <pardefs> tags, for example:

  <pardefs>

    ...  

  </pardefs>

We can see where this fits in with the rest of the dictionary below:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>

   ...

  </sdefs>
  <pardefs>

    ...  

  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

Notes

  1. N'otros testos de llingüística suel faese referencia a ellos como "clases" o "categoríes" y "sub-clases".
  2. Una pieza de la llingua (o categoría léxica, clas de pallabra, clase léxica, etc.) ye una categoría llingüística de pallabres, que xeneralmente se defín pol comportamientu morfolóxicu o sintáuticu de la pallabra en cuestión. Ente les categoríes llingüístiques más comunes suelen tar verbos, nomes y otres. Hai amás clases abiertes de pallabres, que algamen nuevos miembros, y clases zarraes, qu'algamen miembros mui poques vegaes.
  3. A morphological dictionary models the rules that govern the internal structure of words in a language. For example, speakers of English realise that the words "dog" and "dogs" are related, that "dogs" is to "dog" as "cats" is to "cat". The rules understood by the speaker reflect specific patterns and regularities in the way in which words are formed from smaller units and how those smaller units interact.
  4. The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionaries.
  5. Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses, for example "run" can be a verb (They run on weekends), or a noun (I'm going for a run).
  6. see Dictionaries for more complete information on the format