Difference between revisions of "Morphological dictionary"

From Apertium
Jump to navigation Jump to search
Line 59: Line 59:
</pre>
</pre>
It is worth noting that a finite-state acceptor can accept any string that can be defined by a regular expression. For example, if we want to accept the expression <code>be*r</code>, that is "b" followed by one or more "e" followed by "r" (e.g. ber, beer, beeer, ...), we could do it with a finite-state acceptor.
It is worth noting that a finite-state acceptor can accept any string that can be defined by a regular expression. For example, if we want to accept the expression <code>be*r</code>, that is "b" followed by one or more "e" followed by "r" (e.g. ber, beer, beeer, ...), we could do it with a finite-state acceptor.


:''fill in more formal stuff here''


===Transducers===


==Terminology==
==Terminology==

Revision as of 14:43, 18 August 2008

This page intends to describe how and why the morphological dictionaries in Apertium work the way they do. Morphological dictionaries in Apertium (or more properly lttoolbox) are based on finite-state transducer technology, in this way they can also be referred to as lexical transducers. The task of a morphological dictionary is to model the rules that govern the internal structure of words in a language.

For example, speakers of English realise that the words "dog" and "dogs" are related, that "dogs" is to "dog" as "cats" is to "cat". The rules understood by the speaker reflect specific patterns and regularities in the way in which words are formed from smaller units and how those smaller units interact.

Finite-state transducers are not the only way to model these rules, it is also possible to write the rules in scripting languages such as perl or python, or as a lexer (examples include the Buckwalter morphological analyser for Arabic or IceMorphy for Icelandic). There are however a number of downsides to this method:

  • The analysers created are not reversible, that is, you cannot use the same model to analyse and generate.
  • As the rule content may be both imperative and declarative, programs can be more complicated to understand and edit by non-experts.

In contrast, finite-state transducers are: reversible, the same description can be used for both analysis and generation; declarative, in that a description of the morphological rules is written as separate from the algorithm which processes them. Note that analysers may also be described as decorated tries, or finite-state acceptors (for example hunmorph), this may be declarative, but non-reversible (i.e. not applicable to generation).

A finite-state acceptor for the string "beer".

Finite-state automata

To start with it is worth defining what is a finite-state automaton, and how the two main types differ. This will not be an exhaustive description, just an overview so that the difference can be distinguished for the purposes of this article. To begin with, some terminology, if you are familiar with graphs (as in the data structure) this might help. A finite-state automaton can be visualised as a graph, with the nodes representing states, and the arcs representing transitions.

Acceptors

A finite-state acceptor (or recogniser), as seen to the right, is an automaton which accepts or rejects input strings. So, taking the example at the right, we can see the automaton has:

  • a start state, in formal definitions this is usually labelled
  • a number of intermediate states, often denoted by
  • two final states, denoted by
  • a number of transitions

We can crudely model this in a programming language such as python in order to get an idea of the behaviour of this automata.

states = ['b', 'e', 'e', 'r']; # Set of states
current_state = 0;             # Set current state to start state 

c = sys.stdin.read(1);

while c:                       # Input loop
        if current_state == len(states): # If we've reached the final state
                sys.stdout.write('Yes');
                sys.exit(1);
        elif c == states[current_state]: # If the input matches the value of the current state
                current_state = current_state + 1;
        else:                            # If the input does not match the current state and we're not final
                sys.stdout.write('No');
                sys.exit(1);

        c = sys.stdin.read(1);

When the input on stdin is "beer", output yes, otherwise output no,

This finite-state acceptor will accept any string defined by the regular expression be*r, i.e. br, ber, beer, beeer, beeeer, ...
$ python fsa.py
beer
Yes

$ python fsa.py
bee
No

It is worth noting that a finite-state acceptor can accept any string that can be defined by a regular expression. For example, if we want to accept the expression be*r, that is "b" followed by one or more "e" followed by "r" (e.g. ber, beer, beeer, ...), we could do it with a finite-state acceptor.


fill in more formal stuff here


Transducers

Terminology

  • string
  • alphabet
  • empty string

Specification format

Sections

Standard
Inconditional
Postblank

Transformation

A representation of the transducer for the noun "beer" in English.

Minimisation

Usage

Tokenisation

See also

Further reading

  • Ortiz-Rojas, S., Forcada, M. L., and Ramírez-Sánchez, G. (2005) "Construcción y minimizacion eficiente de transductores de letras a partir de diccionarios con paradigmas". Procesamiento del Lenguaje Natural, 35, 51–57.
  • A. Garrido-Alenda, M.L. Forcada, (2002) "Comparing nondeterministic and quasideterministic finite-state transducers built from morphological dictionaries", Procesamiento del Lenguaje Natural, (XVIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural, Valladolid, Spain, 11-13.09.2002)
  • R.C. Carrasco, M.L. Forcada, (2002) "Incremental construction and maintenance of minimal finite-state automata", Computational Linguistics, 28:2, 207-216
  • Alicia Garrido-Alenda, Mikel L. Forcada, Rafael C. Carrasco, (2002) "Incremental construction and maintenance of morphological analysers based on augmented letter transducers", in Proceedings of TMI 2002 (Theoretical and Methodological Issues in Machine Translation, Keihanna/Kyoto, Japan, March 2002), p. 53-62