Construction and efficient minimisation of letter transducers from dictionaries with paradigms
- 1 Introduction
- 2 XML Format for the dictionaries
- 3 Obtaining paradigms
- 4 Construction of letter transducers
- 5 Results
- 6 Conclusion
In the design and implementation of lexical processing systems, one of the tasks is the construction of efficient lexical processors from linguistic data.
The lexical processors that are described here have been used for lexical transformation, for example morphological analysis and generation, along with word-for-word translation of lexical forms.
Morphological analysis of a word involves retrieving from a dictionary the lexical form (the lemma and morphological attributes) of the word from the surface (superficial) form.
The reverse process is morphological generation. This involves generating the surface form of a word from a lexical form. Word-for-word translation of lexical forms involves drawing correspondences between a lexical form in one language, and the translation in another. This final operation is crucial in constructing machine translation systems.
Words may have one form (as in the case of invariant words), or more forms. The variation of words receives various names depending on their nature. For example, they may be derivational, which is when a word combines with another word, or with morphemes which then modify its sense (e.g. president and vicepresident etc.); inflectional, which are grammatical modifications that occur in nouns, adjectives and verbs in languages like those in the Indo-European family (e.g. go, goes, etc.); agglutinative, which is when affixes are added together and affect the whole grammatical phrase. Agglutination is found in languages like Turkish and Basque (e.g. urdin, urdina, urdinarena, that in Spanish correspond to azul, el azul, el del azul respectively); or every other type of orthographic variation that can occur in a language.
For convienience, the regularities observed in processing these phenomena can be grouped in the construction of morphological dictionaries in order to avoid having to write all the forms of each word. This can be as useful for analysis as for generation. From the viewpoint of dictionary management, it is useful to store word inflections as pairs of lemmata, and inflection paradigms. This allows us to add a word by simply giving the lemma, and then choosing from a previously defined inflectional paradigms. If a suitable paradigm does not exist, it is possible to define a new one. A further benefit is that if an error is identified in one of the inflectional paradigms, it is only necessary to correct it in one place.
In the same way, some derivational mechanism can be treated in a similar manner, but only when they are systematic in some lemmas: for example, the formation of superlatives from adjectives in languages like Catalan or Spanish, the composition of certain lemmas with determined prefixes (like ex-, vice-, or suffixes etc.) and other similar cases can benefit from the same advantages as described previously.
In this paper we denominate the grouping of transformation rules between parts of words —to manage the phenomena which have been explained— like definition of paradigms, without reducing ourselves to treating inflection exclusively.
The format of the dictionaries is defined in a specification which uses XML, for the interoperability, as much for the advantanges which are presented by explicit relations between elements, because it allows us to express the encoding of characters of all the data in an explicit way, and also for the large amount of effective tools which exist for processing and transforming data in XML format.
Finally we see that it is possible to exploit the division of entries in the dictionary between lemma and paradigm to effectively construct minimised letter transducers. These minimised letter transducers are designed for the efficient processing of natural language. In Garrido et al. (1999), a compiler is presented with these characteristics, but that didn't completely take advantage of the factorisation that is permitted by the paradigms to increase the speed of construction. In Daciuk et al. (2002), Carrasco and Forcada (2002), and Garrido-Alenda et al. (2002) methods of incrementally constructing minimised letter transducers are presented as an alternative model to that which is presented in this article.
XML Format for the dictionaries
An XML based format has been designed to store the dictionary information. The DTD (document type definition, one of the ways of specifying an XML format) of this format includes: sections for specifying the characters that are considered part of the alphabet —that is those characters which can form part of a word— sections to define the symbols which have morphological meaning, sections for the definition of paradigms and identification of regular expressions such as numbers or internet addresses. We do not include any reference to this DTD because it is still under development.
Figure 1. shows an example of the definition of a paradigm and its use in the dictionary. Each paradigm has entries (
<e> elements), and in this example, every all of the entries consist of a pair (
<p>) which has a left part (
<l>) and a right part (
<r>). Between these elements, text or morphological symbols (
<s>) can be included. The entries of the dictionary are defined in the same way, the tag for identity
<i> is an abbreviated form of a pair where the left side and the right side are identical. The paradigm of a the word is expressed in this example by the paradigm reference
It is also possible to see that not all paradigms can be defined as cyclic, but only those that do not accept an empty string, given that there could be a case where the output for a given entry will be infinite (loop without consuming all the input). Detecting whether a paradigm has been incorrectly defined cyclically is a job for the compiler which constructs the letter transducer.
In these dictionaries, the lexical forms which correspond with the surface forms of the entries are made up of a lemma, and an ordered list of morphological tags. The first of the specified tags is treated as the part of speech tag, while the rest of the tags are referred to as lexical subcategory tags.
Paradigms used for constructing the dictionaries like those which are presented in this paper can be obtained by the following methods:
- Manually. A linguist decides how to form the paradigms in order to unify all the surface forms and their corresponding lexical forms. This can be necessary for the convenience of the linguist.
- Automatically. A program can calculate the suffix paradigms unifying all the entries which have the same lemma and the same part-of-speech tag in a single paradigm definition. A similar form can be produced with prefix paradigms or following every other criteria.
- Automatically and manually. It is sometimes necessary to combine both techniques in order to get the desired results.
Construction of letter transducers
In this article, we denote as the empty string, and as an empty symbol.
We define two alphabets, , or the input alphabet and , or the output alphabet. We call a string transduction a pair so that is the input string and is the output string. For this relation with an empty string, we distinguish the transform or null transform, the transductions of the form or insertions and the transductions of the form or deletions. The null transduction is a special case of insertion or deletion. Transductions may be concatentated, .
We call paradigm (and denote with ) a set of transforms without any restriction on the type of their content. We say that the paradigm is cyclable if it doesn't include any insertions.
The paradigms can be concatentated with transforms o between in the following way:
When considered as letter transducers, paradigms have an initial state and a set of final states . When we talk of the final state of a paradigm, we refer to a unique state which can be created in every moment such that all the are unified by means of null transitions .
We define the output (of a paradigm) as the concatentation of paradigms and transforms in every proportion or order that defines a sub-set of transforms of a given paradigm.
A dictionary of transforms is defined (for example to incorporate linguistic knowledge) using paradigms which describe inflection. A dictionary of transforms is presented as a set in which:
- is the set of the inputs of the dictionary of transforms. The set of inputs of the dictionary can be seen as a big paradigm, which the restriction that it cannot contain any insertions. In the dictionary the paradigms serve to represent the existing linguistic knowledge in the truths observed by the lexical processor.
- is a set of the definitions of the content of the paradigms which is used in the input of the dictionary and in other paradigms.
- is a sub-set of P, of cyclic paradigms. The restriction is given that it is only possible to define cyclic paradigms from paradigms that can be cyclic.
A dictionary can be represented by means of a letter transducer that has as its input alphabet and as its output alphabet ; is defined. The set of states of the transducer is , the initial state is , the set of final states and the transfer function , and therefore it is indeterministic with respect of the input as respect of the new alphabet .
Construction of minimised letter transducers
From a string transform, a sequence of letter transforms, of longitude can be constructed which is defined as it follows string element :
This is to emphasise that, by construction, it is ensured that no can exist which are equal to , this is crucial for the consistency of the method of construction that can subsequently be seen.
The method of construction uses two procedures, the construction procedure which is seen in equation 4, and that of minimisation, which is produced by a conventional finite state automata minimisation algorithm (van de Snepscheut, 1993) which consists of reverse, determinise, then reverse again and determinise again. It takes as the alphabet of the automata that is minimised the Cartesian product and for null transition .
In figure 2, a simple example of construction is provided. Transform by transform is introduced joined in equation 4 and the transducer in the form of prefix acceptor, or trie, that is to say the way in that there is only one node for every common prefix of the set of transforms which form the dictionary. With the suffixes of the transforms (which are not separated), new states are created.
At the point in which the reference to a paradigm is made, a copy of the paradigm is created and is connected to the dictionary input that is inserting in the transducer by means of a null transform .
Every one of the paradigms, in as much as can be seen with small dictionaries are constructed by this same procedure and their time and they have been minimised by reducing the size of the problem in the construction of a large dictionary. In figures 3 and 4, the state of the paradigms used in figure 3 is shown after the minimisation.
With cyclic paradigms (see figures 5 and 6) a similar form is operated. In these examples it is shown how to construct a paradigm in the dictionary and with a letter transducer. The paradigm constructed can be used in input of the dictionary with certain nouns in the Basque language which end in a consonant.
Advanced construction of letter transducers by means of paradigms
It is a fact that in the languages of Europe, modifications of words take place at the beginning or the end of the word. This fact can be exploited to improve the speed of construction of the minimised transducer.
With the first improvement, the paradigms can be minimised when they are defined in a way which allows the handling of small paradigms in the construction process. The paradigms in the dictionaries of the languages we have worked on usually they have some a few hundred states. The minimisation of these paradigms is a very quick process.
If we presume that an input can present a reference to a paradigm in every point of the paradigm, we can think of copying at this point the transducer in order that the definition of the paradigm can be calculated. The procedure that is presented here assumes that it is not always necessary to copy, if not that in determined occasions it is possible to reuse a paradigm that has already been copied. In particular two or three entries that are shared between a paradigm like suffixes can be reused in the same copy of these paradigm. the same goes for prefixes.
Nevertheless, in general, it is not possible to reuse paradigms if they are found in inner positions of different entries, because new suffixes (prefixes) may be produced for existing entries. This would make the information that is introduced into the transducer not consistent with the dictionary, and thus the transducer which would be generated would be incorrect (it would produce transforms which are not found in the language which defines the dictionaries).
We define as the th copy of the paradigm during the construction of . In the same way, the th copy of the initial state of the paradigm is defined and the th copy of the final state of the paradigm .
We can say that an entry in the dictionary that starts with can reuse as a prefix if the initial state has a single input transition. What is more, this input transform must necessarily be connected to and must be null . The reuse of the copy of this paradigm is taken to the end by means of connecting with the remaining suffix in the entry by means of a null transition.
Analogously, a dictionary entry which ends by can reuse the copy of as a suffix if there is a single output transition of the final state of this copy of the paradigm, , which is also null and which connects this copy with a final state from , that is to say, that . The reuse of this paradigm will consist by means of joining the remaining prefix of the entry with by means of a null transition.
To check the performance of the system three available dictionaries of Catalan, Spanish and Portuguese have been used, all of which have a similar number of entries (between 30,000 and 35,000 lemmata). These correspond in each case to between a million and a half — three million surface forms. Tests were made to see how the previous minimisation of the paradigms, and, additionally in this improvement the reuse of paradigms affected the speed of construction of the letter transducer.
The table in figure 7 shows the number of states that are generated in the construction of the transducer after minimisation, the comparison with the minimum, and how the application of the two improvements influenced consecutively to obtain a non-minimised transducer of the order of only one quarter times bigger than the minimum in place of the fifty or one hundred times. In practice the performance of this method of construction reduces the time of construction from hours to a matter of seconds. Furthermore, the amount of memory that is needed during the construction varies in each case, but it is between 10 and 20 times less than is needed if the copies of the paradigms are not reused in this way.
By means of the procedure described here, data and language processors for the project "Open-source machine translation for the languages of the Spanish state" have been developed and analysers and generators have been produced which can process from 40,000 words per second on a PC. The time of construction of the letter transducers is around fifteen seconds for dictionaries of around 35,000 entries that with the paradigms developed have corresponded with four million different words.
In terms of the size of memory consumed, the dictionaries currently occupy around four megabytes of XML documents. If we expand the transforms of the dictionaries —the non-cyclical part —, these occupy around 400 megabytes each. After being compiled, the resulting transducer occupies around 600 kilobytes on disc and in execution around 10 megabytes.
We have seen how the managements of dictionaries by means of paradigms is a technique that constitutes a coherent form of dictionary management for languages of different natures and which allows the rapid generation of efficient lexical processors.