Transducers as flag diacritics and their topology learning
This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal to study and use topology-learned transducers to be placed as such 2nd level transducers.
Contents
Short Description
The implementation of a module to handle languages with infix inflection (and possibly other forms), to stop, according to defined constraints, useless and invalid continuation computations as flag diacritics do, for example in the HFST platform, but using instead the novel approach of a second/n level. This is approach is desired as transducers may provide greater power, expressiveness, and flexibility.
Description
The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.
Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.
The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.
Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.
The programming language to use will be mostly Java, using C++ if required.
Contribution
- The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
- The novelty, as far as I know, of the use of transducers to tackle such a problem.
- Altogether, a better support for languages with more complex forms of inflection.
Work Plan: Timeline
I mostly consider here the exact and specific work for the GSoC project, not the thesis's. The following is an estimation:
- Community Bonding Period: April 25 - May 23:
- know well Apertium's project and internals; get to know the community; know work & code standards; review C++; study Apertium's architecture and code
- study the lttoolbox-java
- study problem to solve; study flag diacritics approach; design plan solution for the problem
- Start
- Week 1-3: code implementation; parallel testing & documentation
- Week 4: code implementation; start of more formal testing
- Deliverable #1: module ready for the lttoolbox-java & documentation
- Verification on target language
- Week 5: apply module to some specific language and validate correctness
- Deliverable #2: positive formal verification results
- Topology Learning
- Week 6: research state-of-the-art methods and algorithms; design solution; ultimate corpus for which to apply the learning method
- Week 7-10: implement, review, test, and document the topology learning algorithm
- Week 11: verify algorithm
- Deliverable #3: topology learning algorithm module ready & documentation
- Set & Run
- Week 12: (1) Deliver transducer/flag diacritics module into official program; (2) deliver topology learning module into official program
- Project Completed: Deliverable #4: verified & delivered code & documentation
Original Conversation with Jimmy O'Regan
* [16: 04] <jimregan> Jacob and I designed an alternative to flag diacritics on the back of some paper plates last year * [16:04] <jimregan> using a second transducer * [16:04] <jimregan> (the original motivation behind flag diacritics - which are an ugly hack - was to avoid using a second transducer) * [16:05] <jimregan> we use a fairly large number of transducers, in almost every phase of the translation pipeline * [16:05] <jimregan> so we thought it wouldn't be unreasonable to have a second transducer to use in state pruning, rather than just balancing special symbols * [16:07] <jimregan> the second transducer would basically take the same form as translation rules in apertium-transfer, but with the ability to fully or partially lexicalise (possibly based on a subset of regexes) * [16:07] <jimregan> if the initial analysis involves a continuation, pass it to the second for validation * [16:07] <jmcejuela> aha * [16:07] <jimregan> prune it if it doesn't match * [16:08] <jimregan> take the symbols of the second transduction for output if it does * [16:08] <jimregan> Jacob was pretty pationate about implementing it himself, but I think he'd be interested in mentoring it * [16:10] <jimregan> I just remembered how to summarise it: two level morphology as cascading transducers * [16:11] <jimregan> but, as it's cascading, two level could potentially be made n-level