Talk:Morphological dictionary

From Apertium
Jump to navigation Jump to search

Introduction

One of the ? that there are to develop in the design and implmentation of lexical processing systems is the construction of efficient lexical processors from linguistic data.

In particular, the lexical processors that are described here have been used for lexical transforms ? like morphological analysis, morphological generation and word-for-word translation of lexical forms.

The morphological analysis of a word is getting from its superficial (surface) form, all of the lexical forms (meaning the lemma and morphological attributes) that are given in a dictionary.

The morphological generation is the reverse process, from a lexical form, generating the surface form. Word for word translation of lexical forms consists of making correspondances between a lexical form in one language, and a lexical form in another language. This final operation of crucial for constructing machine translation systems.

Words may have one (as in the case of invariant words) or more forms. Variations in words receive various names depending on their nature. They can be derivational, which is when a word combines with another or with morphemes that modify its sense (e.g. president and vicepresident etc.); inflections, which are grammatical modifications which occur in nouns, adjectives and verbs in Indo-European like languages (e.g. go, goes, etc.); agglutination which is affixes which are added together that affect the whole phrase from a grammatical point of view, this is found in languages like Turkish and Basque (e.g. urdin, urdina, urdinarena, that in Spanish correspond to azul, el azul, el del azul respectively); or every other type of orthographic variation that can occur in every language.

The regularities observed in the processing can be grouped, for convienience, in the construction of morphological dictionaries (as much for analysis as generation), to avoid having to write all the forms for each word. From the point of view of the management of the dictionaries, it is interesting to store the inflection of words in inflectional paradigms identified by a side and the lemmas that inflect for another. This allows us to add a word by giving the lemma and choosing from previously defined inflectional paradigms, or defining a new paradigm for adding further words with the same inflection. On the other hand if an error is identified in one of the inflectional paradigms, it is only necessary to correct it in one place.

In the same way, some derivational mechanism can be treated in a similar manner, but only when they are systematic in some lemmas: for example, the formation of superlatives from adjectives in languages like Catalan or Spanish, the composition of certain lemmas with determined prefixes (like ex-, vice-, or suffixes etc.) and other cases that can be treated in the same way as the inflection for these phenomenona can benefit from the same advantages as in the previous case.

In this paper we denominate the grouping of transformation rules between parts of words -- to manage the phenomena which have been explained -- like definition of paradigms, without reducing ourselves to exclusively treating inflection.

The format of the dictionaries is defined in a specification which uses XML, for the interoperability, as much for the advantanges which are presented by explicit relations between elements, because it allows us to express the encoding of characters of all the data in an explicit way, and also for the large amount of effective tools which exist for processing and transforming data in XML format.

Finally we see that it is possible to exploit the division of entries in the dictionary between lemma and paradigm to effectively construct minimised letter transducers. These minimised letter transducers are designed for the efficient processing of natural language. In Garrido et al. (1999), a compiler is presented with these characteristics, but that didn't completely take advantage of the factorisation that is permitted by the paradigms to increase the speed of construction. In Daciuk et al. (2002), Carrasco and Forcada (2002), and Garrido-Alenda et al. (2002) methods of incrementally constructing minimised letter transducers are presented as an alternative model to that which is presented in this article.

XML Format for the dictionaries

A format based on XML has been designed to store the dictionary information. The DTD (document type definition, one of the ways of specifying an XML format) of this format includes sections for specifying the characters that are considered part of the alphabet -- in this sense, that can form part of a word-- to define the symbols which have morphological sense, definition of paradigms and identification of regular expressions ? like numbers or internet addresses. At the moment to send this article, we do not include any reference to this DTD because it is still under development.

Figure 1 shows an example of the definition of a paradigm and its use in the dictionary. Each paradigm has entries (<e> elements), and in this case, every entry consists of a pair (<p>) which a left part (<l>) and a right part (<r>). Between these elements, text or morphological symbols (<s>) can be included. The entries of the dictionary are defined in the same way, the tag for identity <i> is an abbreviated form of a pair where the left side and the right side are identical. The paradigm of a the word is expressed in this case by the paradigm reference <par>.

It is possible to define cyclic paradigms by only indicating between an attribute of the paradigm. It is also possible to notice that all the paradigms can be defined as cyclic, only excepting those which do not accept an empty string, ya que se puede dar el caso de que the output will be infinite for a given entry (loop without consuming the entry). Detecting whether a paradigm has been incorrectly defined cyclically is a job for the compiler which constructs the letter transducer.

Obtaining paradigms

The lexical forms which correspond with the surface forms of the entries in these dictionaries are composed of the lemma and an ordered list of morphological tags. The first of the tags that is specified is treated as the part of speech tag, while the rest of the tags are called the lexical subcategory tags.

Paradigms that are used for constructing the dictionaries that are like those which are presented in this paper can be obtained by the following procedures:

  • Manually. A linguist decides how to form the paradigms to unify all the surface forms and their corresponding lexical forms. This can be necessary for the convienience of the linguist.
  • Automatically. A program can calculate the paradigms-suffix unifying all the entries which have the same lemma and the same part-of-speech tag in a single paradigm definition. A similar form can be produced with paradigms-prefix or following every other criteria.
  • Automatically and manually. Sometimes it is necessary to combine both previously given techniques for get the desired results.

Construction of letter transducers

Preliminary definitions

In this article, we denote Failed to parse (syntax error): {\displaystyle ∊} as the empty string, and as an empty symbol.

We define two alphabets, Σ, or the input alphabet and Γ, or the output alphabet. We call a string transduction as a pair (s : t) so that s ∈ Σ* is the input string and t ∈ Γ* is the output string. For this relation with an empty string, we distinguish the transduction ( : ) of null transduction. the transductions of the form ( : s) or insertions and the transductions of the form (s: ) or deletions. The null transduction is a special case of insertion or deletion. Transductions may be concatentated, ( : ) · (x : y) = (sx : ty).

Paradigms

We call paradigm (and denote with P) a set of transforms without any restriction on the type of their content. We say that the paradigm is cyclable if it doesn't include any insertions.

The paradigms can be concatentated with transforms o between in the following way:


Paradigms considered as letter transducers have an initial state iP and a set of final states Fp. When we talk of the final state of a paradigm, we refer to a unique state fP which can be created in every moment such that all the qFP are unified by means of null transitions (θ : θ).

We define the output (of a paradigm) as the concatentation of paradigms and transforms in every proportion or order that defines a sub-set of transforms of a given paradigm.

Dictionaries

A dictionary of transforms is defined (for example to incorporate linguistic knowledge) using paradigms which describe inflection. A dictionary of transforms is presented as a set D = (E,P,PC) which which:

  • E is the set of the inputs of the dictionary of transforms. The set of inputs of the dictionary can be seen as a big paradigm, which the restriction that it cannot contain any insertions. In the dictionary the paradigms serve to represent the existing linguistic knowledge in the truths observed by the lexical processor.
  • P is a set of the definitions of the content of the paradigms which is used in the input of the dictionary and in other paradigms.
  • PC is a sub-set of P, of cyclic paradigms. The restriction is given that it is only possible to define cyclic paradigms from paradigms that can be cyclic.

A dictionary D can be represented by means of a letter transducer that has as its input alphabet ∑ ⋃ {θ} and as its output alphabet Γ ⋃ {θ}; L = (∑ ⋃ {θ}) X (Γ ⋃ {θ}) is defined. The set of states of the transducer is Q, the initial state is qIQ, the set of final states FQ and the transfer function δ : Q X L → 2Q, and therefore it is indeterministic with respect of the input as respect of the new alphabet L.

Construction of minimised letter transducers

From a string transform, a sequence of letter transforms, S(s : t) of longitude N = max(|s|,|t|) can be constructed which is defined as it follows string element 1 ≤ iN:

This is to emphasise that, by construction, it is ensured that no (s : t) can exist which are equal to ( : ), this is crucial for the consistency of the method of construction that can subsequently be seen.

The method of construction uses two procedures, the procedure of assembly which is seen in equation 4, and that of minimisation, which is produced by a conventional finite state automata minimisation algorithm (van de Snepscheur, 1993) which consists of invertir, determinizar, volver a invertir y volver a determinizar, taking as the alphabet of the automata that there is minimised the Cartesian product L and for null transition (θ : θ).

In figure 2, a simple example of assembly is provided. Transform by transform is introduced joined in equation 4 and the transducer in the form of prefix acceptor, or trie, that is to say the way in that there is only one node for every common prefix of the set of transforms which form the dictionary. With the suffixes of the transforms (which are not separated), new states are created.

At the point in which the reference to a paradigm is made, a copy of the paradigm is created and is connected to the dictionary input that is inserting in the transducer by means of a null transform (θ : θ).

Every one of the paradigms, in as much as can be seen with small dictionaries are constructed by this same procedure and their time and they have been minimised by reducing the size of the problem in the construction of a large dictionary. In figures 3 and 4, the state of the paradigms used in figure 3 is shown after the minimisation.

With cyclic paradigms (see figures 5 and 6) a similar form is operated. In these examples it is shown how to construct a paradigm in the dictionary and with a letter transducer. The paradigm constructed can be used in input of the dictionary with certain nouns in the Basque language which end in a consonant.

Avanzar construction of letter transducers by means of paradigms

It is a fact that in the languages of Europe, modifications of words take place at the beginning or the end of the word. This fact can be exploited to improve the speed of construction of the minimised transducer.