Transducers as flag diacritics and their topology learning

This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal to study and use topology-learned transducers to be placed as such 2nd level transducers.

Short Description

The implementation of a module to handle languages with infix inflection (and possibly other forms), to stop, according to defined constraints, useless and invalid continuation computations as flag diacritics do, for example in the HFST platform, but using instead the novel approach of a second/n level. This is approach is desired as transducers may provide greater power, expressiveness, and flexibility.

Description

The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.

Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.

The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.

Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.

The programming language to use will be mostly Java, using C++ if required.

Contribution

The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
The novelty, as far as I know, of the use of transducers to tackle such a problem.
Altogether, a better support for languages with more complex forms of inflection.

Work Plan: Timeline

I mostly consider here the exact and specific work for the GSoC project, not the thesis's. The following is an estimation:

Community Bonding Period: April 25 - May 23:
know well Apertium's project and internals; get to know the community; know work & code standards; review C++; study Apertium's architecture and code
study the lttoolbox-java
study problem to solve; study flag diacritics approach; design plan solution for the problem

Start
Week 1-3: code implementation; parallel testing & documentation
Week 4: code implementation; start of more formal testing
Deliverable #1: module ready for the lttoolbox-java & documentation

Verification on target language
Week 5: apply module to some specific language and validate correctness
Deliverable #2: positive formal verification results

Topology Learning
Week 6: research state-of-the-art methods and algorithms; design solution; ultimate corpus for which to apply the learning method
Week 7-10: implement, review, test, and document the topology learning algorithm
Week 11: verify algorithm
Deliverable #3: topology learning algorithm module ready & documentation

Set & Run
Week 12: (1) Deliver transducer/flag diacritics module into official program; (2) deliver topology learning module into official program

Project Completed: Deliverable #4: verified & delivered code & documentation

Original Conversation with Jimmy O'Regan

 * [16: 04] <jimregan> Jacob and I designed an alternative to flag
 diacritics on the back of some paper plates last year
 * [16:04]  <jimregan> using a second transducer
 * [16:04]  <jimregan> (the original motivation behind flag diacritics -
 which are an ugly hack - was to avoid using a second transducer)
 * [16:05]  <jimregan> we use a fairly large number of transducers, in
 almost every phase of the translation pipeline
 * [16:05]  <jimregan> so we thought it wouldn't be unreasonable to have
 a second transducer to use in state pruning, rather than just
 balancing special symbols
 * [16:07]  <jimregan> the second transducer would basically take the
 same form as translation rules in apertium-transfer, but with the
 ability to fully or partially lexicalise (possibly based on a subset
 of regexes)
 * [16:07]  <jimregan> if the initial analysis involves a continuation,
 pass it to the second for validation
 * [16:07]  <jmcejuela> aha
 * [16:07]  <jimregan> prune it if it doesn't match
 * [16:08]  <jimregan> take the symbols of the second transduction for
 output if it does
 * [16:08]  <jimregan> Jacob was pretty pationate about implementing it
 himself, but I think he'd be interested in mentoring it
 * [16:10]  <jimregan> I just remembered how to summarise it: two level
 morphology as cascading transducers
 * [16:11]  <jimregan> but, as it's cascading, two level could
 potentially be made n-level

Transducers as flag diacritics and their topology learning

Contents

Short Description

Description

Contribution

Work Plan: Timeline

Original Conversation with Jimmy O'Regan

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools