Difference between revisions of "Talk:Morphological dictionary"
Line 77: | Line 77: | ||
elements, text or morphological symbols (<code><s></code>) can be included. The |
elements, text or morphological symbols (<code><s></code>) can be included. The |
||
entries of the dictionary are defined in the same way, the tag for identity <code> |
entries of the dictionary are defined in the same way, the tag for identity <code> |
||
<i |
<i></code> is an abbreviated form of a pair where the left side and the right |
||
side are identical. The paradigm of a the word is expressed in this case by |
side are identical. The paradigm of a the word is expressed in this case by |
||
the paradigm reference <code><par></code>. |
the paradigm reference <code><par></code>. |
||
It is possible to define cyclic paradigms by only indicating between an attribute of the paradigm. It is also possible to notice that all the paradigms can be defined as cyclic, only excepting those which do not accept an empty string, '''ya que se puede dar el caso de que''' the output will be infinite for a given entry (loop without consuming the entry). Detecting whether a paradigm has been incorrectly defined cyclically is a job |
|||
It is possible to define cyclic paradigms by only indicating between an attribute |
|||
of the paradigm. It is also possible to notice that all the paradigms can be defined |
|||
as cyclic, only excepting those which do not accept an empty string, '''ya que se puede dar el caso de que la salida''' will be infinite for a given entry (loop without consuming |
|||
the entry). Detecting whether a paradigm has been incorrectly defined cyclically is a job |
|||
for the compiler which constructs the letter transducer. |
for the compiler which constructs the letter transducer. |
||
Revision as of 17:13, 22 September 2007
One of the ? that there are to develop in the design and implmentation of lexical processing systems is the construction of efficient lexical processors from linguistic data.
In particular, the lexical processors that are described here have been used for lexical transforms ? like morphological analysis, morphological generation and word-for-word translation of lexical forms.
The morphological analysis of a word is getting from its superficial (surface) form, all of the lexical forms (meaning the lemma and morphological attributes) that are given in a dictionary.
The morphological generation is the reverse process, from a lexical form, generating the surface form. Word for word translation of lexical forms consists of making correspondances between a lexical form in one language, and a lexical form in another language. This final operation of crucial for constructing machine translation systems.
Words may have one (as in the case of invariant words) or more forms. Variations in words receive various names depending on their nature. They can be derivational, which is when a word combines with another or with morphemes that modify its sense (e.g. president and vicepresident etc.); inflections, which are grammatical modifications which occur in nouns, adjectives and verbs in Indo-European like languages (e.g. go, goes, etc.); agglutination which is affixes which are added together that affect the whole phrase from a grammatical point of view, this is found in languages like Turkish and Basque (e.g. urdin, urdina, urdinarena, that in Spanish correspond to azul, el azul, el del azul respectively); or every other type of orthographic variation that can occur in every language.
The regularities observed in the processing can be grouped, for convienience, in the construction of morphological dictionaries (as much for analysis as generation), to avoid having to write all the forms for each word. From the point of view of the management of the dictionaries, it is interesting to store the inflection of words in inflectional paradigms identified by a side and the lemmas that inflect for another. This allows us to add a word by giving the lemma and choosing from previously defined inflectional paradigms, or defining a new paradigm for adding further words with the same inflection. On the other hand if an error is identified in one of the inflectional paradigms, it is only necessary to correct it in one place.
In the same way, some derivational mechanism can be treated in a similar manner, but only when they are systematic in some lemmas: for example, the formation of superlatives from adjectives in languages like Catalan or Spanish, the composition of certain lemmas with determined prefixes (like ex-, vice-, or suffixes etc.) and other cases that can be treated in the same way as the inflection for these phenomenona can benefit from the same advantages as in the previous case.
In this paper we denominate the grouping of transformation rules between parts of words -- to manage the phenomena which have been explained -- like definition of paradigms, without reducing ourselves to exclusively treating inflection.
The format of the dictionaries is defined in a specification which uses XML, for the interoperability, as much for the advantanges which are presented by explicit relations between elements, because it allows us to express the encoding of characters of all the data in an explicit way, and also for the large amount of effective tools which exist for processing and transforming data in XML format.
Finally we see that it is possible to exploit the division of entries in the dictionary between lemma and paradigm to effectively construct minimised letter transducers. These minimised letter transducers are designed for the efficient processing of natural language. In Garrido et al. (1999), a compiler is presented with these characteristics, but that didn't completely take advantage of the factorisation that is permitted by the paradigms to increase the speed of construction. In Daciuk et al. (2002), Carrasco and Forcada (2002), and Garrido-Alenda et al. (2002) methods of incrementally constructing minimised letter transducers are presented as an alternative model to that which is presented in this article.
XML Format for the dictionaries
A format based on XML has been designed to store the dictionary information. The DTD (document type definition, one of the ways of specifying an XML format) of this format includes sections for specifying the characters that are considered part of the alphabet -- in this sense, that can form part of a word-- to define the symbols which have morphological sense, definition of paradigms and identification of regular expressions ? like numbers or internet addresses. At the moment to send this article, we do not include any reference to this DTD because it is still under development.
Figure 1 shows an example of the definition of a paradigm and its use in the
dictionary. Each paradigm has entries (<e>
elements), and
in this case, every entry consists of a pair (<p>
) which a left
part (<l>
) and a right part (<r>
). Between these
elements, text or morphological symbols (<s>
) can be included. The
entries of the dictionary are defined in the same way, the tag for identity
<i>
is an abbreviated form of a pair where the left side and the right
side are identical. The paradigm of a the word is expressed in this case by
the paradigm reference <par>
.
It is possible to define cyclic paradigms by only indicating between an attribute of the paradigm. It is also possible to notice that all the paradigms can be defined as cyclic, only excepting those which do not accept an empty string, ya que se puede dar el caso de que the output will be infinite for a given entry (loop without consuming the entry). Detecting whether a paradigm has been incorrectly defined cyclically is a job for the compiler which constructs the letter transducer.
Obtaining paradigms
The lexical forms which correspond with the surface forms of the entries in these dictionaries are composed of the lemma and an ordered list of morphological tags. The first of the tags that is specified is treated as the part of speech tag, while the rest of the tags are called the lexical subcategory tags.
Paradigms that are used for constructing the dictionaries that are like those which are presented in this paper can be obtained by the following procedures:
- Manually. A linguist decides how to form the paradigms to unify all the surface forms
and their corresponding lexical forms. This can be necessary for the convienience of the linguist.
- Automatically. A program can calculate the paradigms-suffix unifying all the entries
which have the same lemma and the same part-of-speech tag in a single paradigm definition. A similar form can be produced with paradigms-prefix or following every other criteria.
- Automatically and manually. Sometimes it is necessary to combine both previously
given techniques for get the desired results.
Construction of letter transducers
In this article, we denote ∊ as the empty string, and θ as an empty symbol.
We define two alphabets, Σ, or the input alphabet and Γ, or the output alphabet. We call a string transduction as a pair (s : t) so that s ∈ Σ* is the input string and t ∈ Γ* is the output string. For this relation with an empty string, we distinguish the transduction (∊ : ∊) of null transduction. the transductions of the form (∊ : s) or insertions and the transductions of the form (s: ∊) or deletions. The null transduction is a special case of insertion or deletion. Transductions may be concatentated, (s : t) · (x : y) = (sx : ty).