Transducers as flag diacritics and their topology learning

From Apertium
Revision as of 11:53, 8 April 2011 by Jmcejuela (talk | contribs)
Jump to navigation Jump to search

This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal to study and use topology-learned transducers to be placed as such 2nd level transducers.


Short Description

The implementation of a module to handle languages with infix inflection (and possibly other forms), to stop, according to defined constraints, useless and invalid continuation computations as flag diacritics do, for example in the HFST platform, but using instead the novel approach of a second/n level. This is approach is desired as transducers may provide greater power, expressiveness, and flexibility.


Description

The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.

Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.

The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.


Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.


The programming language to use will be mostly Java, using C++ if required.


Contribution

  • The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
  • The novelty, as far as I know, of the use of transducers to tackle such a problem.
  • Altogether, a better support for languages with more complex forms of inflection.


Work Plan: Timeline

I mostly consider here the exact and specific work for the GSoC project, not the thesis's. The following is an estimation:


  • Pre-stage: refine project, discuss details with my mentor, advisor, and community.


  • Start
  • Week 1-2: know well Apertium's project and procedures; get to know the community; know work & code standards; review C++; study Apertium's architecture and code
  • Week 3: study Apertium's architecture and code
  • Week 4-5: study the lttoolbox-java
  • Deliverable #1: possibly written description of Apertium's architecture and possible consequent contribution to Apertium's wiki documentation.


  • Week 6-7: Study the problem to solve; study flag diacritics approach; study how to solve the problem
  • Deliverable #2: formal description of the problem to solve


  • Week 8-10: code implementation; parallel testing & documentation
  • Week 11: code implementation; start of more formal tests verification
  • Deliverable #3: published code & documentation


  • Week 12: formal tests verification


  • Project Completed: Deliverable #4: verified, merged, and runnable module code & documentation


Original Conversation with Jimmy O'Regan

  • [16: 04] <jimregan> Jacob and I designed an alternative to flag

diacritics on the back of some paper plates last year

  • [16:04] <jimregan> using a second transducer
  • [16:04] <jimregan> (the original motivation behind flag diacritics -

which are an ugly hack - was to avoid using a second transducer)

  • [16:05] <jimregan> we use a fairly large number of transducers, in

almost every phase of the translation pipeline

  • [16:05] <jimregan> so we thought it wouldn't be unreasonable to have

a second transducer to use in state pruning, rather than just balancing special symbols

  • [16:07] <jimregan> the second transducer would basically take the

same form as translation rules in apertium-transfer, but with the ability to fully or partially lexicalise (possibly based on a subset of regexes)

  • [16:07] <jimregan> if the initial analysis involves a continuation,

pass it to the second for validation

  • [16:07] <jmcejuela> aha
  • [16:07] <jimregan> prune it if it doesn't match
  • [16:08] <jimregan> take the symbols of the second transduction for

output if it does

  • [16:08] <jimregan> Jacob was pretty pationate about implementing it

himself, but I think he'd be interested in mentoring it

  • [16:10] <jimregan> I just remembered how to summarise it: two level

morphology as cascading transducers

  • [16:11] <jimregan> but, as it's cascading, two level could

potentially be made n-level