Transducers as flag diacritics and their topology learning

From Apertium
Revision as of 10:41, 8 April 2011 by Jmcejuela (talk | contribs)
Jump to navigation Jump to search

This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal together with Hasan Ibne Akram (my thesis advisor) to study and use topology-learned transducers to be place as such 2nd level transducers.


Description

The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.

Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.

The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.


Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.


The programming language to use will be mostly Java, using C++ as required to interface with the FST compiling process.


Contribution

  • The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
  • The novelty, as far as I know, of the use of transducers to tackle such a problem.
  • Altogether, a better support for languages with more complex forms of inflection.


Timeline

Original Conversation with Jimmy O'Regan

  • [16: 04] <jimregan> Jacob and I designed an alternative to flag

diacritics on the back of some paper plates last year

  • [16:04] <jimregan> using a second transducer
  • [16:04] <jimregan> (the original motivation behind flag diacritics -

which are an ugly hack - was to avoid using a second transducer)

  • [16:05] <jimregan> we use a fairly large number of transducers, in

almost every phase of the translation pipeline

  • [16:05] <jimregan> so we thought it wouldn't be unreasonable to have

a second transducer to use in state pruning, rather than just balancing special symbols

  • [16:07] <jimregan> the second transducer would basically take the

same form as translation rules in apertium-transfer, but with the ability to fully or partially lexicalise (possibly based on a subset of regexes)

  • [16:07] <jimregan> if the initial analysis involves a continuation,

pass it to the second for validation

  • [16:07] <jmcejuela> aha
  • [16:07] <jimregan> prune it if it doesn't match
  • [16:08] <jimregan> take the symbols of the second transduction for

output if it does

  • [16:08] <jimregan> Jacob was pretty pationate about implementing it

himself, but I think he'd be interested in mentoring it

  • [16:10] <jimregan> I just remembered how to summarise it: two level

morphology as cascading transducers

  • [16:11] <jimregan> but, as it's cascading, two level could

potentially be made n-level