Difference between revisions of "Bases sur les dictionnaires unilingues"

Revision as of 23:32, 17 September 2011

On a dit que le format des dictionnaires d'Apertium n'est pas intuitif, ce qui est assez vrai si vous n'êtes pas habitué à penser aux dictionnaires d'une manière particulière. Cette page espère être une introduction de base sur la façon dont ils fonctionnent et comment vous pouvez commencer à les lire et les écrire !

Cette page suppose que vous êtes à l'aise avec HTML et XML, que vous pouvez distinguer un élément d'un attribut et savez ce que sont les données caractères. Si vous voulez un résumé rapide, ceci devrait vous aider :

<element attribute="value">character data</element>

Si cela n'a aucune signification pour vous, vous devriez probablement lire un peu plus sur le XML.

Introduction

Donc, au niveau global, le dictionnaire le plus basique a besoin de 3 sections. Nous allons, pas à pas, définir un dictionnaire qui va analyser et générer le mot anglais "beer" et sa forme plurielle, "beers".

La première section définit l'alphabet qui est utilisé par le dictionnaire. C'est assez explicite et ça ressemblera à quelque-chose comme :

  <alphabet>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>

La seconde section définit les symboles grammaticaux^[1] de la langue sur laquelle vous travaillez. C'est normalement là où les gens disent, attendez... que sont les symboles grammaticaux ? Hé bien, il y a plusieurs façons de décrire les mots, et les différentes formes que les mots peuvent prendre, donc je suppose que vous savez ce qu'est une partie de discours^[2]. Par exemple: les noms (maison, bière, bateau, chat, ...), que vous pouvez distinguer des adjectifs (rouge, bon, transparent, ...) et des verbes (manger, multiplier, écrire, ...). La manière de les spécifier est la suivante :

  <sdefs>
    <sdef n="noun"/>
    <sdef n="verb"/>
    <sdef n="adjective"/>
  </sdefs>

Les gens se plaignent souvent de la brièveté de ces symboles, et typiquement même les valeurs sont abrégées, ainsi le nom devient "n", le verbe devient "vb" et l'adjectif devient "adj" etc... (voir Liste de symboles pour quelques abréviations courantes). La brièveté présente toutefois un intérêt, quand vous écrivez, ou copiez vous voulez que les symboles restent aussi petits que possible. Par exemple, <sdef> veut dire "définition de symbole", et <sdefs> est simplement la même chose au pluriel.

Après avoir spécifié l'alphabet et les symboles, on a besoin de spécifier les mots, c'est la partie la plus importante du dictionnaire ! Pour supporter les mots on utilise une section. Il peut y avoir plus d'une section dans un dictionnaire, et il y a plus d'un type de section. On ne rentrera pas dans les détails ici, mais traditionnellement la plus grande section est appelée "main" et est de type "standard".

  <section id="main" type="standard">

  </section>

La prochaine étape est de rajouter une entrée. C'est un peu plus compliqué, donc s'il vous plaît, continuez de lire...

Entrées

Les dictionnaires unilingues dans Apertium sont des dictionnaires morphologiques^[3], cela signifie qu'ils ne supportent pas seulement les mots, mais qu'ils supportent aussi comment ils s'infléchissent, et ce que leur infléchissement signifie. Dans Apertium on utilise les dictionnaires morphologiques pour deux traitements :

Analyse — retrieving all of the possible lexical units from the forme de surface of a word.
Génération — producing the forme de surface of a word from the lexical unit.

Ok, now to explain lexical unit and surface form. Remember the example of "beer" and "beers"? We know that "beer" is a noun, we also know that it is in the singular, we also know that the only difference between "beer" and "beers" is that "beers" is in the plural. So, summarising this knowledge below, we find the following two facts:

beer — is a singular noun,
beers — is the plural form of the noun "beer".

What we mean by lexical unit is the combination of the lemma^[4], e.g. "beer" and the grammatical symbols. The surface form of a word is the word as you read it.^[5] In Apertium style these would be represented something like the following:

Surface form	Lexical unit
beer	beer<noun><singular>
beers	beer<noun><plural>

In order to convert between these two forms, we need to define them as a pair. Pairs of surface forms and lexical units in Apertium are indicated by the <p> element. This is rather intuitive, so long as you know the abréviation! These pair elements may contain a "left side" (<l>) and a "right side" (<r>). The left side almost always contains the surface form of the word, while the right side contains the lexical unit. So, our first entry (<e>) might look something like the following:

    <e>
      <p>
        <l>beer</l>
        <r>beer<s n="noun"/><s n="singular"/></r>
      </p>
    </e>

Now, roughly, you need as many of these entries as there are surface forms in the language, however the astute among you will have realised that creating entries for all the words in the language is an impossible task. The next section will show how this can be avoided, but in the mean time we now have enough information to compile our first dictionary:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="noun"/>
    <sdef n="singular"/>
    <sdef n="plural"/>
  </sdefs>

  <section id="main" type="standard">
    <e>
      <p>
        <l>beer</l>
        <r>beer<s n="noun"/><s n="singular"/></r>
      </p>
    </e>
    <e>
      <p>
        <l>beers</l>
        <r>beer<s n="noun"/><s n="plural"/></r>
      </p>
    </e>
  </section>
</dictionary>

The entries above will enable us to retrieve the lexical units for "beer" and "beers", and generate these two surface forms from the same lexical units.

The dictionary is functional, but is intended for teaching purposes, actual dictionary files look somewhat different, because defining each word completely separately from other words which follow the same rules is rather inefficient.

Compilation

See also: lttoolbox

Save this into a file called dictionary.dix, then we'll compile the dictionary into a binary form^[6] using the tool lt-comp. The command takes three arguments, the first is "direction", then input file and output file. The "direction" option is important.

If we specify the direction as "lr" (left → right), we get an analyser (that is, a dictionary that takes surface forms and outputs lexical units. If we specify the reverse ("rl", right → left), we get a generator, which takes lexical units and outputs surface forms. We might as well generate both:

$ lt-comp lr dictionary.dix analyser.bin
main@standard 7 6

$ lt-comp rl dictionary.dix generator.bin
main@standard 7 6

We can now use the dictionary to analyse the noun "beers":

$ echo "beers" | lt-proc analyser.bin
^beers/beer<noun><plural>$

The analysis gives us the surface form, followed by the lexical unit. Say we want to generate the surface form from the lexical unit, we just do:

$ echo "^beer<noun><plural>$" | lt-proc -g generator.bin
beers

Paradigms

So, great, we have a dictionary and we can analyse and generate the two forms of the words "beer". But what happens when we want to add more words, say "school", or "computer". Well, one thing we could do is just add four more entries in the main section (one for each of "school", "schools", "computer" and "computers"). On the other hand, this would be pretty inefficient. Instead, we can generalise a rule, which in this case is "add -s to make the plural", using a paradigm, which is literally, "an example serving as a model or pattern".

In order to define paradigms, we typically take a word that can serve as an example for how other words inflect. In this case, we can say, "the words school and computer inflect like beer".

Paradigms go in a section called <pardefs> (paradigm definitions), below the <sdefs> and above the main section. They are defined in <pardef> (paradigm definition) elements. Each paradigm definition must have an attribute "id", which contains a unique name. This id can be anything, but conventionally takes the form of:

<lemma>__<part of speech>, (e.g. beer__n)

In order to make the lexical units for beer, beers, computer, computers, etc... we need to distinguish between the part of the surface form that doesn't change (the identical part), and the part that does change. In the example already given, it is quite straightforward that the identical part is always the singular form. However, this might not always be the case (e.g. "wolf, wolves" or "tooth, teeth").

You probably guessed already what the paradigm definition is going to look like, so here it is:

    <pardef n="beer__n">
      <e>
        <p>
          <l/>
          <r><s n="noun"/><s n="singular"/></r>
        </p>
      </e>
      <e>
        <p>
          <l>s</l>
          <r><s n="noun"/><s n="plural"/></r>
        </p>
      </e>
    </pardef>

The only thing that has changed between these two entries, and the first ones we made is that the identical part has been removed from both sides of the pair.

The paradigm definition goes into its own part of the dictionary, enclosed in <pardefs> tags, for example:

  <pardefs>

    ...  

  </pardefs>

We can see where this fits in with the rest of the dictionary below:

<dictionary>
  <alphabet>abcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>

   ...

  </sdefs>
  <pardefs>

    ...  

  </pardefs>
  <section id="main" type="standard">
    <e lm="beer"><i>beer</i><par n="beer__n"/></e>
    <e lm="school"><i>school</i><par n="beer__n"/></e>
    <e lm="computer"><i>computer</i><par n="beer__n"/></e>
    <e lm="house"><i>house</i><par n="beer__n"/></e>
  </section>
</dictionary>

Notes

↑ Dans d'autres documents linguistiques ils sont quelquefois appelés "features" (caractéristiques), ou "catégories" et "sous-catégories".
↑ une partie de discours (ou catégorie lexicale, classe de mot, classe lexicale, etc.) est une catégorie linguistique de mots, qui est généralement définie par le comportement syntaxique ou morphologique du mot en question. Les catégories linguistiques communes comprennent le nom et le verbe, entre autres. Il y a des classes de mots ouvertes, qui acquièrent constamment de nouveaux membres, et des classes de mots fermées, qui n'acquièrent de nouveaux membres que rarement si ça arrive.
↑ Un dictionnaire morphologique modélise les règles qui régissent la structure interne des mots dans une langue. Par exemple, les francophones se rendent compte que les mots "chien" et "chiens" sont liés, que "chiens" est à "chien" ce que "chats" est à "chat". Les règles comprises par le locuteur reflètent des modèles spécifiques et des régularités des modèles spécifiques et des régularités dans la façon dont les mots sont formés à partir des plus petites unités et comment ces plus petites unités interagissent.
↑ The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionnaires.
↑ Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses, for example "run" can be a verb (They run on weekends), or a noun (I'm going for a run).
↑ see Dictionaries for more complete information on the format

[1] Dans d'autres documents linguistiques ils sont quelquefois appelés "features" (caractéristiques), ou "catégories" et "sous-catégories".

[2] une partie de discours (ou catégorie lexicale, classe de mot, classe lexicale, etc.) est une catégorie linguistique de mots, qui est généralement définie par le comportement syntaxique ou morphologique du mot en question. Les catégories linguistiques communes comprennent le nom et le verbe, entre autres. Il y a des classes de mots ouvertes, qui acquièrent constamment de nouveaux membres, et des classes de mots fermées, qui n'acquièrent de nouveaux membres que rarement si ça arrive.

[3] Un dictionnaire morphologique modélise les règles qui régissent la structure interne des mots dans une langue. Par exemple, les francophones se rendent compte que les mots "chien" et "chiens" sont liés, que "chiens" est à "chien" ce que "chats" est à "chat". Les règles comprises par le locuteur reflètent des modèles spécifiques et des régularités des modèles spécifiques et des régularités dans la façon dont les mots sont formés à partir des plus petites unités et comment ces plus petites unités interagissent.

[4] The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionnaires.

[5] Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses, for example "run" can be a verb (They run on weekends), or a noun (I'm going for a run).

[6] see Dictionaries for more complete information on the format

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 1: / Line 1: @@
-On a dit que le format des dictionnaires d'Apertium n'est pas intuitif, ce qui est assez vrai si vous n'êtes pas habitué à penser aux dictionnaires d'une manière particulière. Cette page espère être une introduction '''de base''' introduction sur la façon dont ils fonctionnent et comment vous pouvez commencer à les lire et les écrire !
+On a dit que le format des dictionnaires d'Apertium n'est pas intuitif, ce qui est assez vrai si vous n'êtes pas habitué à penser aux dictionnaires d'une manière particulière. Cette page espère être une introduction '''de base''' sur la façon dont ils fonctionnent et comment vous pouvez commencer à les lire et les écrire !
-Cette page suppose que vous êtes à l'aise avec HTML et XML, et suppose que vous pouvez distinguer un élément d'un attribut et ce que sont les données caractères. Si vous voulez un résumé rapide, ceci devrait vous aider :
+Cette page suppose que vous êtes à l'aise avec HTML et XML, que vous pouvez distinguer un élément d'un attribut et savez ce que sont les données caractères. Si vous voulez un résumé rapide, ceci devrait vous aider :
 :<element attribute="value">character data</element>
-Si cela n'a aucune signification pour vois, vous devriez probablement lire un peu plus sur le XML.
+Si cela n'a aucune signification pour vous, vous devriez probablement lire un peu plus sur le XML.
 == Introduction ==
@@ Line 41: / Line 41: @@
 == Entrées ==
+Les dictionnaires unilingues dans Apertium sont des dictionnaires ''morphologiques''<ref>Un dictionnaire morphologique modélise les règles qui régissent la structure interne des mots dans une langue. Par exemple, les francophones se rendent compte que les mots "chien" et "chiens" sont liés, que "chiens" est à "chien" ce que "chats" est à "chat". Les règles comprises par le locuteur reflètent des modèles spécifiques et des régularités des modèles spécifiques et des régularités dans la façon dont les mots sont formés à partir des plus petites unités et comment ces plus petites unités interagissent.</ref>, cela signifie qu'ils ne supportent pas seulement les mots, mais qu'ils supportent aussi comment ils s'infléchissent, et ce que leur infléchissement signifie. Dans Apertium on utilise les  dictionnaires morphologiques pour deux traitements :
-The monolingual dictionaries in Apertium are ''morphological''<ref>A morphological dictionary models the rules that govern the internal structure of words in a language. For example, speakers of English realise that the words "dog" and "dogs" are related, that "dogs" is to "dog" as "cats" is to "cat". The rules understood by the speaker reflect specific patterns and regularities in the way in which words are formed from smaller units and how those smaller units interact.</ref> dictionaries, this means that they not only hold words, but they also hold how they inflect, and what it means when they inflect. In Apertium we use the morphological dictionaries for two tasks:
-# Analysis &mdash; retrieving all of the possible lexical units from the [[surface form]] of a word.
+# Analyse &mdash; retrieving all of the possible lexical units from the [[forme de surface]] of a word.
-# Generation &mdash; producing the [[surface form]] of a word from the lexical unit.
+# Génération &mdash; producing the [[forme de surface]] of a word from the lexical unit.
 Ok, now to explain ''lexical unit'' and ''surface form''. Remember the example of "beer" and "beers"? We know that "beer" is a noun, we also know that it is in the singular, we also know that the only difference between "beer" and "beers" is that "beers" is in the plural. So, summarising this knowledge below, we find the following two facts:
@@ Line 51: / Line 51: @@
 # beers &mdash; is the plural form of the noun "beer".
-What we mean by ''lexical unit'' is the combination of the lemma<ref>The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionaries.</ref>, e.g. "beer" and the grammatical symbols. The surface form of a word is the word as you read it.<ref>Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses, for example "run" can be a verb (''They run on weekends''), or a noun (''I'm going for a run'').</ref>  In Apertium style these would be represented something like the following:
+What we mean by ''lexical unit'' is the combination of the lemma<ref>The lemma (or citation form, base form, head word) is the canonical form of a word. It is the form of the word that is typically used in paper dictionnaires.</ref>, e.g. "beer" and the grammatical symbols. The surface form of a word is the word as you read it.<ref>Surface forms can be ambiguous, but lexical units cannot. A surface form may have many analyses, for example "run" can be a verb (''They run on weekends''), or a noun (''I'm going for a run'').</ref>  In Apertium style these would be represented something like the following:
 :{|class=wikitable
@@ Line 105: / Line 105: @@
 The dictionary is functional, but is intended for teaching purposes, actual dictionary files look somewhat different, because defining each word completely separately from other words which follow the same rules is rather inefficient.
-===Compilation===
+=== Compilation ===
 {{see-also|lttoolbox}}
 Save this into a file called <code>dictionary.dix</code>, then we'll compile the dictionary into a binary form<ref>see [[Dictionaries]] for more complete information on the format</ref> using the tool <code>lt-comp</code>. The command takes three arguments, the first is "direction", then input file and output file. The "direction" option is important.
@@ Line 133: / Line 135: @@
 </pre>
-==Paradigms==
+== Paradigms ==
 So, great, we have a dictionary and we can analyse and generate the two forms of the words "beer". But what happens when we want to add more words, say "school", or "computer". Well, one thing we could do is just add four more entries in the main section (one for each of "school", "schools", "computer" and "computers"). On the other hand, this would be pretty inefficient. Instead, we can generalise a rule, which in this case is "add ''-s'' to make the plural", using a ''paradigm'', which is literally, "an example serving as a model or pattern".
@@ Line 200: / Line 202: @@
 </pre>
-==Notes==
+== Notes ==
 <references/>

Difference between revisions of "Bases sur les dictionnaires unilingues"

Revision as of 23:32, 17 September 2011

Contents

Introduction

Entrées

Compilation

Paradigms

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools