Difference between revisions of "Talk:Morphological dictionary"
| Line 204: | Line 204: | ||
| By means of the procedure described here, data and language processors for the project "Open-source machine translation for the languages of the Spanish state" have been developed and analysers and generators have been produced which can process from 40,000 words per second on a PC, and the time of construction of the letter transducers is around fifteen seconds for dictionaries of around 35,000 entries that with the paradigms developed have corresponded with four million different words. | By means of the procedure described here, data and language processors for the project "Open-source machine translation for the languages of the Spanish state" have been developed and analysers and generators have been produced which can process from 40,000 words per second on a PC, and the time of construction of the letter transducers is around fifteen seconds for dictionaries of around 35,000 entries that with the paradigms developed have corresponded with four million different words. | ||
| In terms of the size of memory consumed, the dictionaries currently occupy around four megabytes of XML documents. If we expand the transforms of the dictionaries -- the non-cyclical part --, these occupy around 400 megabytes in any case. Compiled once, the resulting transducer occupies around 600 kilobytes on disc and in execution around 10 megabytes. | |||
| ==Conclusion== | |||
| We have seen how the managements of dictionaries by means of paradigms is a technique that constitutes a coherent form of dictionary management for languages of different natures and which allows the rapid generation of efficient lexical processors. | |||
Revision as of 19:05, 22 September 2007
Contents
Introduction
One of the ? that there are to develop in the design and implmentation of lexical processing systems is the construction of efficient lexical processors from linguistic data.
In particular, the lexical processors that are described here have been used for lexical transforms ? like morphological analysis, morphological generation and word-for-word translation of lexical forms.
The morphological analysis of a word is getting from its superficial (surface) form, all of the lexical forms (meaning the lemma and morphological attributes) that are given in a dictionary.
The morphological generation is the reverse process, from a lexical form, generating the surface form. Word for word translation of lexical forms consists of making correspondances between a lexical form in one language, and a lexical form in another language. This final operation of crucial for constructing machine translation systems.
Words may have one (as in the case of invariant words) or more forms. Variations in words receive various names depending on their nature. They can be derivational, which is when a word combines with another or with morphemes that modify its sense (e.g. president and vicepresident etc.); inflections, which are grammatical modifications which occur in nouns, adjectives and verbs in Indo-European like languages (e.g. go, goes, etc.); agglutination which is affixes which are added together that affect the whole phrase from a grammatical point of view, this is found in languages like Turkish and Basque (e.g. urdin, urdina, urdinarena, that in Spanish correspond to azul, el azul, el del azul respectively); or every other type of orthographic variation that can occur in every language.
The regularities observed in the processing can be grouped, for convienience, in the construction of morphological dictionaries (as much for analysis as generation), to avoid having to write all the forms for each word. From the point of view of the management of the dictionaries, it is interesting to store the inflection of words in inflectional paradigms identified by a side and the lemmas that inflect for another. This allows us to add a word by giving the lemma and choosing from previously defined inflectional paradigms, or defining a new paradigm for adding further words with the same inflection. On the other hand if an error is identified in one of the inflectional paradigms, it is only necessary to correct it in one place.
In the same way, some derivational mechanism can be treated in a similar manner, but only when they are systematic in some lemmas: for example, the formation of superlatives from adjectives in languages like Catalan or Spanish, the composition of certain lemmas with determined prefixes (like ex-, vice-, or suffixes etc.) and other cases that can be treated in the same way as the inflection for these phenomenona can benefit from the same advantages as in the previous case.
In this paper we denominate the grouping of transformation rules between parts of words -- to manage the phenomena which have been explained -- like definition of paradigms, without reducing ourselves to exclusively treating inflection.
The format of the dictionaries is defined in a specification which uses XML, for the interoperability, as much for the advantanges which are presented by explicit relations between elements, because it allows us to express the encoding of characters of all the data in an explicit way, and also for the large amount of effective tools which exist for processing and transforming data in XML format.
Finally we see that it is possible to exploit the division of entries in the dictionary between lemma and paradigm to effectively construct minimised letter transducers. These minimised letter transducers are designed for the efficient processing of natural language. In Garrido et al. (1999), a compiler is presented with these characteristics, but that didn't completely take advantage of the factorisation that is permitted by the paradigms to increase the speed of construction. In Daciuk et al. (2002), Carrasco and Forcada (2002), and Garrido-Alenda et al. (2002) methods of incrementally constructing minimised letter transducers are presented as an alternative model to that which is presented in this article.
XML Format for the dictionaries
A format based on XML has been designed to store the dictionary information. The DTD (document type definition, one of the ways of specifying an XML format) of this format includes sections for specifying the characters that are considered part of the alphabet -- in this sense, that can form part of a word-- to define the symbols which have morphological sense, definition of paradigms and identification of regular expressions ? like numbers or internet addresses. At the moment to send this article, we do not include any reference to this DTD because it is still under development.
Figure 1 shows an example of the definition of a paradigm and its use in the            
dictionary. Each paradigm has entries (<e> elements), and
in this case, every entry consists of a pair (<p>) which a left
part (<l>) and a right part (<r>). Between these
elements, text or morphological symbols (<s>) can be included. The
entries of the dictionary are defined in the same way, the tag for identity 
<i> is an abbreviated form of a pair where the left side and the right
side are identical. The paradigm of a the word is expressed in this case by
the paradigm reference <par>.
It is possible to define cyclic paradigms by only indicating between an attribute of the paradigm. It is also possible to notice that all the paradigms can be defined as cyclic, only excepting those which do not accept an empty string, ya que se puede dar el caso de que the output will be infinite for a given entry (loop without consuming the entry). Detecting whether a paradigm has been incorrectly defined cyclically is a job for the compiler which constructs the letter transducer.
Obtaining paradigms
The lexical forms which correspond with the surface forms of the entries in these dictionaries are composed of the lemma and an ordered list of morphological tags. The first of the tags that is specified is treated as the part of speech tag, while the rest of the tags are called the lexical subcategory tags.
Paradigms that are used for constructing the dictionaries that are like those which are presented in this paper can be obtained by the following procedures:
- Manually. A linguist decides how to form the paradigms to unify all the surface forms and their corresponding lexical forms. This can be necessary for the convienience of the linguist.
- Automatically. A program can calculate the paradigms-suffix unifying all the entries which have the same lemma and the same part-of-speech tag in a single paradigm definition. A similar form can be produced with paradigms-prefix or following every other criteria.
- Automatically and manually. Sometimes it is necessary to combine both previously given techniques for get the desired results.
Construction of letter transducers
Preliminary definitions
In this article, we denote Failed to parse (syntax error): {\displaystyle ∊} as the empty string, and as an empty symbol.
We define two alphabets, Σ, or the input alphabet and Γ, or the output alphabet. We call a string transduction as a pair (s : t) so that s ∈ Σ* is the input string and t ∈ Γ* is the output string. For this relation with an empty string, we distinguish the transduction (∊ : ∊) of null transduction. the transductions of the form (∊ : s) or insertions and the transductions of the form (s: ∊) or deletions. The null transduction is a special case of insertion or deletion. Transductions may be concatentated, ( : ) · (x : y) = (sx : ty).
Paradigms
We call paradigm (and denote with P) a set of transforms without any restriction on the type of their content. We say that the paradigm is cyclable if it doesn't include any insertions.
The paradigms can be concatentated with transforms o between in the following way:
Paradigms considered as letter transducers have an initial state iP and a set of final
states Fp. When we talk of the final state of a paradigm, we refer to a unique state fP
which can be created in every moment such that all the q ∈ FP are unified by means of
null transitions (θ : θ).
We define the output (of a paradigm) as the concatentation of paradigms and transforms in every proportion or order that defines a sub-set of transforms of a given paradigm.
Dictionaries
A dictionary of transforms is defined (for example to incorporate linguistic knowledge) using paradigms which describe inflection. A dictionary of transforms is presented as a set D = (E,P,PC) which which:
- E is the set of the inputs of the dictionary of transforms. The set of inputs of the dictionary can be seen as a big paradigm, which the restriction that it cannot contain any insertions. In the dictionary the paradigms serve to represent the existing linguistic knowledge in the truths observed by the lexical processor.
- P is a set of the definitions of the content of the paradigms which is used in the input of the dictionary and in other paradigms.
- PC is a sub-set of P, of cyclic paradigms. The restriction is given that it is only possible to define cyclic paradigms from paradigms that can be cyclic.
A dictionary D can be represented by means of a letter transducer that has as its input alphabet ∑ ⋃ {θ} and as its output alphabet Γ ⋃ {θ}; L = (∑ ⋃ {θ}) X (Γ ⋃ {θ}) is defined. The set of states of the transducer is Q, the initial state is qI ∈ Q, the set of final states F ⊂ Q and the transfer function δ : Q X L → 2Q, and therefore it is indeterministic with respect of the input as respect of the new alphabet L.
Construction of minimised letter transducers
From a string transform, a sequence of letter transforms, S(s : t) of longitude N = max(|s|,|t|) can be constructed which is defined as it follows string element 1 ≤ i ≤ N:
This is to emphasise that, by construction, it is ensured that no (s : t) can exist which are equal to (∊ : ∊), this is crucial for the consistency of the method of construction that can subsequently be seen.
The method of construction uses two procedures, the procedure of assembly which is seen in equation 4, and that of minimisation, which is produced by a conventional finite state automata minimisation algorithm (van de Snepscheur, 1993) which consists of invertir, determinizar, volver a invertir y volver a determinizar, taking as the alphabet of the automata that there is minimised the Cartesian product L and for null transition (θ : θ).
In figure 2, a simple example of assembly is provided. Transform by transform is introduced joined in equation 4 and the transducer in the form of prefix acceptor, or trie, that is to say the way in that there is only one node for every common prefix of the set of transforms which form the dictionary. With the suffixes of the transforms (which are not separated), new states are created.
At the point in which the reference to a paradigm is made, a copy of the paradigm is created and is connected to the dictionary input that is inserting in the transducer by means of a null transform (θ : θ).
Every one of the paradigms, in as much as can be seen with small dictionaries are constructed by this same procedure and their time and they have been minimised by reducing the size of the problem in the construction of a large dictionary. In figures 3 and 4, the state of the paradigms used in figure 3 is shown after the minimisation.
With cyclic paradigms (see figures 5 and 6) a similar form is operated. In these examples it is shown how to construct a paradigm in the dictionary and with a letter transducer. The paradigm constructed can be used in input of the dictionary with certain nouns in the Basque language which end in a consonant.
Avanzar construction of letter transducers by means of paradigms
It is a fact that in the languages of Europe, modifications of words take place at the beginning or the end of the word. This fact can be exploited to improve the speed of construction of the minimised transducer.
With the first improvement, the paradigms can be minimised when they are defined that which allows the handling of small paradigms in the construction process. With the paradigms of the dictionaries of the languages on which we have worked, usually they have some a few hundred states. The minimisation of these paradigms is a very quick process.
If we suppose that an input can present a reference to a paradigm in every point of the paradigm, we can think of copying at this point the transducer in order that the definition of the paradigm can be calculated. The procedure that is presented here assumes that it is not always necessary to copy, if not that in determined occasions it is possible to reuse a paradigm that has already been copied. In particular two or three inputs/entries that are shared between a paradigm like suffixes can be reused in the same copy of these paradigm and the same occurs when it happens with a prefix.
Nevertheless, in general it is not possible to reuse paradigms if they are not found in intermediate positions of different entries/inputs, since new suffixes (prefixes) can be introduced to existing entries, which makes that the information that is introduced in the transducer is not consistent with the dictionary and the transducer which has been generated will be incorrect (it would produce transforms which are not found in the language which defines the dictionaries).
We define Pi[n] as the nth copy of the paradigm i during the construction of D. In the same way, the nth copy of the initial state of the paradigm iPi[n] is defined and the nth copy of the final state of the paradigm fPi[n].
We can say that an entry/input in the dictionary that starts with Pi can reuse Pi[n] as a prefix if the initial state iPi[n] has a single input transition. What is more, this input transform must necessarily be connected to qI and must be null (θ : θ). The reuse of the copy of this paradigm is taken to the end by means of connecting fPi[n] with the remaining suffix in the entry by means of a null transition.
Analogously, a dictionary entry which ends by Pi can reuse the copy of Pi[n] as a suffix if there is a single output transition of the final state of this copy of the paradigm, fPi[n] , which is also null and which connects this copy with a final state from D, that is to say, that q ∈ F. The reuse of this paradigm will consist by means of joining the remaining prefix of the entry with iPi[n] by means of a null transition.
Results
To verify the rendimiento of the system three available dictionaries of Catalan, Spanish and Portuguese have been used, all of which have a similar number of entries (between 30,000 and 35,000 lemmata), which correspond in each case to between a million and a half and three million surface forms. Tests were made to see how the previous minimisation of the paradigms, and, additionally in this improvement the reuse of paradigms affected the speed of construction of the letter transducer.
In the table in figure 7, we can see the number of states that are generated in the construction of the transducor after minimisation and the comparison with the minimum and how the application of the two improvements imfluenced consecutively to obtain a non-minimised transducer of the order of only one quarter times bigger than the minimum in place of the fifty or one hundred times. In practice the rendimiento of this method of construction reduces the time of construction of the transducer minimally from hours to a matter of seconds. What is more the amount of memory that is needed during the construction varies in each case, but it is between 10 and 20 times less than is needed if the copies of the paradigms are not reused in this way.
By means of the procedure described here, data and language processors for the project "Open-source machine translation for the languages of the Spanish state" have been developed and analysers and generators have been produced which can process from 40,000 words per second on a PC, and the time of construction of the letter transducers is around fifteen seconds for dictionaries of around 35,000 entries that with the paradigms developed have corresponded with four million different words.
In terms of the size of memory consumed, the dictionaries currently occupy around four megabytes of XML documents. If we expand the transforms of the dictionaries -- the non-cyclical part --, these occupy around 400 megabytes in any case. Compiled once, the resulting transducer occupies around 600 kilobytes on disc and in execution around 10 megabytes.
Conclusion
We have seen how the managements of dictionaries by means of paradigms is a technique that constitutes a coherent form of dictionary management for languages of different natures and which allows the rapid generation of efficient lexical processors.

