Talk:Morphological dictionary

From Apertium
Jump to navigation Jump to search

Construction and efficient minimisation of letter transducers from dictionaries with paradigms

Introduction

In the design and implementation of lexical processing systems, one of the tasks is the construction of efficient lexical processors from linguistic data.

The lexical processors that are described here have been used for lexical transformation, for example morphological analysis and generation, along with word-for-word translation of lexical forms.

Morphological analysis of a word involves retrieving from a dictionary the lexical form (the lemma and morphological attributes) of the word from the surface (superficial) form.

The reverse process if morphological generation. This involves generating the surface form of a word from a lexical form. Word-for-word translation of lexical forms involves drawing correspondences between a lexical form in one language, and the translation in another. This final operation is crucial in constructing machine translation systems.

Words may have one form (as in the case of invariant words), or more forms. The variation of words receives various names depending on their nature. For example, they may be derivational, which is when a word combines with another word, or with morphemes which then modify its sense (e.g. president and vicepresident etc.); inflectional, which are grammatical modifications that occur in nouns, adjectives and verbs in languages like those in the Indo-European family (e.g. go, goes, etc.); agglutinative, which is when affixes are added together and affect the whole grammatical phrase. Agglutination is found in languages like Turkish and Basque (e.g. urdin, urdina, urdinarena, that in Spanish correspond to azul, el azul, el del azul respectively); or every other type of orthographic variation that can occur in a language.

For convienience, the regularities observed in processing these phenomena can be grouped in the construction of morphological dictionaries in order to avoid having to write all the forms of each word. This can be as useful for analysis as for generation. From the viewpoint of dictionary management, it is useful to store word inflections as pairs of lemmata, and inflection paradigms. This allows us to add a word by simply giving the lemma, and then choosing from a previously defined inflectional paradigms. If a suitable paradigm does not exist, it is possible to define a new one. A further benefit is that if an error is identified in one of the inflectional paradigms, it is only necessary to correct it in one place.

In the same way, some derivational mechanism can be treated in a similar manner, but only when they are systematic in some lemmas: for example, the formation of superlatives from adjectives in languages like Catalan or Spanish, the composition of certain lemmas with determined prefixes (like ex-, vice-, or suffixes etc.) and other similar cases can benefit from the same advantages as described previously.

In this paper we denominate the grouping of transformation rules between parts of words —to manage the phenomena which have been explained— like definition of paradigms, without reducing ourselves to treating inflection exclusively.

The format of the dictionaries is defined in a specification which uses XML, for the interoperability, as much for the advantanges which are presented by explicit relations between elements, because it allows us to express the encoding of characters of all the data in an explicit way, and also for the large amount of effective tools which exist for processing and transforming data in XML format.

Finally we see that it is possible to exploit the division of entries in the dictionary between lemma and paradigm to effectively construct minimised letter transducers. These minimised letter transducers are designed for the efficient processing of natural language. In Garrido et al. (1999), a compiler is presented with these characteristics, but that didn't completely take advantage of the factorisation that is permitted by the paradigms to increase the speed of construction. In Daciuk et al. (2002), Carrasco and Forcada (2002), and Garrido-Alenda et al. (2002) methods of incrementally constructing minimised letter transducers are presented as an alternative model to that which is presented in this article.

XML Format for the dictionaries

An XML based format has been designed to store the dictionary information. The DTD (document type definition, one of the ways of specifying an XML format) of this format includes: sections for specifying the characters that are considered part of the alphabet —that is those characters which can form part of a word— sections to define the symbols which have morphological meaning, sections for the definition of paradigms and identification of regular expressions such as numbers or internet addresses. We do not include any reference to this DTD because it is still under development.

Figure 1. shows an example of the definition of a paradigm and its use in the dictionary. Each paradigm has entries (<e> elements), and in this example, every all of the entries consist of a pair (<p>) which has a left part (<l>) and a right part (<r>). Between these elements, text or morphological symbols (<s>) can be included. The entries of the dictionary are defined in the same way, the tag for identity <i> is an abbreviated form of a pair where the left side and the right side are identical. The paradigm of a the word is expressed in this example by the paradigm reference <par>.

It is also possible to see that not all paradigms can be defined as cyclic, but only those that do not accept an empty string, given that there could be a case where the output for a given entry will be infinite (loop without consuming all the input). Detecting whether a paradigm has been incorrectly defined cyclically is a job for the compiler which constructs the letter transducer.

Obtaining paradigms

In these dictionaries, the lexical forms which correspond with the surface forms of the entries are made up of a lemma, and an ordered list of morphological tags. The first of the specified tags is treated as the part of speech tag, while the rest of the tags are referred to as lexical subcategory tags.

Paradigms used for constructing the dictionaries like those which are presented in this paper can be obtained by the following methods:

  • Manually. A linguist decides how to form the paradigms in order to unify all the surface forms and their corresponding lexical forms. This can be necessary for the convenience of the linguist.
  • Automatically. A program can calculate the suffix paradigms unifying all the entries which have the same lemma and the same part-of-speech tag in a single paradigm definition. A similar form can be produced with prefix paradigms or following every other criteria.
  • Automatically and manually. It is sometimes necessary to combine both techniques in order to get the desired results.

Construction of letter transducers

Preliminary definitions

In this article, we denote Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \varepsilon\,} as the empty string, and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \theta\,} as an empty symbol.

We define two alphabets, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Sigma\,} , or the input alphabet and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma \,} , or the output alphabet. We call a string transduction a pair Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s : t)} so that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s \in \Sigma*} is the input string and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t \in \Gamma*} is the output string. For this relation with an empty string, we distinguish the transform Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\varepsilon : \varepsilon)\,} or null transform, the transductions of the form Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\varepsilon : s)\,} or insertions and the transductions of the form Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s : \varepsilon)\,} or deletions. The null transduction is a special case of insertion or deletion. Transductions may be concatentated, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s : t) \cdot (x : y) = (sx : ty)\,} .

Paradigms

We call paradigm (and denote with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P\,} ) a set of transforms without any restriction on the type of their content. We say that the paradigm is cyclable if it doesn't include any insertions.

The paradigms can be concatentated with transforms o between in the following way:


When considered as letter transducers, paradigms have an initial state Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle iP\,} and a set of final states Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Fp\,} . When we talk of the final state of a paradigm, we refer to a unique state Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle fP\,} which can be created in every moment such that all the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q \in FP\,} are unified by means of null transitions Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\theta : \theta)\,} .

We define the output (of a paradigm) as the concatentation of paradigms and transforms in every proportion or order that defines a sub-set of transforms of a given paradigm.

Dictionaries

A dictionary of transforms is defined (for example to incorporate linguistic knowledge) using paradigms which describe inflection. A dictionary of transforms is presented as a set Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D = (E,P,P^C)\,} in which:

  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E} is the set of the inputs of the dictionary of transforms. The set of inputs of the dictionary can be seen as a big paradigm, which the restriction that it cannot contain any insertions. In the dictionary the paradigms serve to represent the existing linguistic knowledge in the truths observed by the lexical processor.
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P} is a set of the definitions of the content of the paradigms which is used in the input of the dictionary and in other paradigms.
  • Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P^C} is a sub-set of P, of cyclic paradigms. The restriction is given that it is only possible to define cyclic paradigms from paradigms that can be cyclic.

A dictionary Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} can be represented by means of a letter transducer that has as its input alphabet Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Sigma \cup {\theta}} and as its output alphabet Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Gamma \cup {\theta}} ; Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = (\Sigma \cup {\theta}) \times (\Gamma \cup {\theta})} is defined. The set of states of the transducer is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q} , the initial state is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle qI \in Q} , the set of final states Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle F \subset Q} and the transfer function Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \delta : Q \times L \rightarrow 2^Q} , and therefore it is indeterministic with respect of the input as respect of the new alphabet Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L\,} .

Construction of minimised letter transducers

From a string transform, a sequence of letter transforms, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S(s : t)\,} of longitude Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N = max(|s|,|t|)\,} can be constructed which is defined as it follows string element Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 1 \le i \le N} :

This is to emphasise that, by construction, it is ensured that no Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s : t)\,} can exist which are equal to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\varepsilon : \varepsilon)} , this is crucial for the consistency of the method of construction that can subsequently be seen.

The method of construction uses two procedures, the procedure of assembly which is seen in equation 4, and that of minimisation, which is produced by a conventional finite state automata minimisation algorithm (van de Snepscheur, 1993) which consists of invertir, determinizar, volver a invertir y volver a determinizar, taking as the alphabet of the automata that there is minimised the Cartesian product Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L\,} and for null transition Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\theta : \theta)\,} .

In figure 2, a simple example of assembly is provided. Transform by transform is introduced joined in equation 4 and the transducer in the form of prefix acceptor, or trie, that is to say the way in that there is only one node for every common prefix of the set of transforms which form the dictionary. With the suffixes of the transforms (which are not separated), new states are created.

At the point in which the reference to a paradigm is made, a copy of the paradigm is created and is connected to the dictionary input that is inserting in the transducer by means of a null transform Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\theta : \theta)\,} .

Every one of the paradigms, in as much as can be seen with small dictionaries are constructed by this same procedure and their time and they have been minimised by reducing the size of the problem in the construction of a large dictionary. In figures 3 and 4, the state of the paradigms used in figure 3 is shown after the minimisation.

With cyclic paradigms (see figures 5 and 6) a similar form is operated. In these examples it is shown how to construct a paradigm in the dictionary and with a letter transducer. The paradigm constructed can be used in input of the dictionary with certain nouns in the Basque language which end in a consonant.

Advanced construction of letter transducers by means of paradigms

It is a fact that in the languages of Europe, modifications of words take place at the beginning or the end of the word. This fact can be exploited to improve the speed of construction of the minimised transducer.

With the first improvement, the paradigms can be minimised when they are defined in a way which allows the handling of small paradigms in the construction process. The paradigms in the dictionaries of the languages we have worked on usually they have some a few hundred states. The minimisation of these paradigms is a very quick process.

If we presume that an input can present a reference to a paradigm in every point of the paradigm, we can think of copying at this point the transducer in order that the definition of the paradigm can be calculated. The procedure that is presented here assumes that it is not always necessary to copy, if not that in determined occasions it is possible to reuse a paradigm that has already been copied. In particular two or three entries that are shared between a paradigm like suffixes can be reused in the same copy of these paradigm. the same goes for prefixes.

Nevertheless, in general, it is not possible to reuse paradigms if they are found in inner positions of different entries, because new suffixes (prefixes) may be produced for existing entries. This would make the information that is introduced into the transducer not consistent with the dictionary, and thus the transducer which would be generated would be incorrect (it would produce transforms which are not found in the language which defines the dictionaries).

We define Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{i}^{[n]}} as the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} th copy of the paradigm Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i} during the construction of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} . In the same way, the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} th copy of the initial state of the paradigm Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_{P_{i}}^{[n]}} is defined and the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} th copy of the final state of the paradigm Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_{P_{i}}^{[n]}} .

We can say that an entry in the dictionary that starts with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{i}} can reuse Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{i}^{[n]}} as a prefix if the initial state Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_{P_{i}}^{[n]}} has a single input transition. What is more, this input transform must necessarily be connected to Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle qI\,} and must be null Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\theta : \theta)\,} . The reuse of the copy of this paradigm is taken to the end by means of connecting Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_{P_{i}}^{[n]}} with the remaining suffix in the entry by means of a null transition.

Analogously, a dictionary entry which ends by Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Pi\,} can reuse the copy of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_{i}^{[n]}} as a suffix if there is a single output transition of the final state of this copy of the paradigm, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_{P_{i}}^{[n]}} , which is also null and which connects this copy with a final state from Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} , that is to say, that Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q \in F} . The reuse of this paradigm will consist by means of joining the remaining prefix of the entry with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i_{P_{i}}^{[n]}} by means of a null transition.

Results

To check the performance of the system three available dictionaries of Catalan, Spanish and Portuguese have been used, all of which have a similar number of entries (between 30,000 and 35,000 lemmata). These correspond in each case to between a million and a half — three million surface forms. Tests were made to see how the previous minimisation of the paradigms, and, additionally in this improvement the reuse of paradigms affected the speed of construction of the letter transducer.

The table in figure 7 shows the number of states that are generated in the construction of the transducer after minimisation, the comparison with the minimum, and how the application of the two improvements influenced consecutively to obtain a non-minimised transducer of the order of only one quarter times bigger than the minimum in place of the fifty or one hundred times. In practice the performance of this method of construction reduces the time of construction from hours to a matter of seconds. Furthermore, the amount of memory that is needed during the construction varies in each case, but it is between 10 and 20 times less than is needed if the copies of the paradigms are not reused in this way.

By means of the procedure described here, data and language processors for the project "Open-source machine translation for the languages of the Spanish state" have been developed and analysers and generators have been produced which can process from 40,000 words per second on a PC. The time of construction of the letter transducers is around fifteen seconds for dictionaries of around 35,000 entries that with the paradigms developed have corresponded with four million different words.

In terms of the size of memory consumed, the dictionaries currently occupy around four megabytes of XML documents. If we expand the transforms of the dictionaries —the non-cyclical part —, these occupy around 400 megabytes in any case. After being compiled, the resulting transducer occupies around 600 kilobytes on disc and in execution around 10 megabytes.

Conclusion

We have seen how the managements of dictionaries by means of paradigms is a technique that constitutes a coherent form of dictionary management for languages of different natures and which allows the rapid generation of efficient lexical processors.