Difference between revisions of "Transducers as flag diacritics and their topology learning"

From Apertium
Jump to navigation Jump to search
 
(3 intermediate revisions by 2 users not shown)
Line 49: Line 49:
 
* '''Verification on target language'''
 
* '''Verification on target language'''
 
* Week 5: apply module to some specific language and validate correctness
 
* Week 5: apply module to some specific language and validate correctness
'''Deliverable #2''': positive formal verification results
+
* '''Deliverable #2''': positive formal verification results
   
   
 
* '''Topology Learning'''
 
* '''Topology Learning'''
* Week 6: research state-of-the-art methods and algorithms; design solution; select corpus for which to apply the learning method
+
* Week 6: research state-of-the-art methods and algorithms; design solution; ultimate corpus for which to apply the learning method
 
* Week 7-10: implement, review, test, and document the topology learning algorithm
 
* Week 7-10: implement, review, test, and document the topology learning algorithm
 
* Week 11: verify algorithm
 
* Week 11: verify algorithm
'''Deliverable #3''': topology learning algorithm module ready & documentation
+
* '''Deliverable #3''': topology learning algorithm module ready & documentation
   
   
Line 69: Line 69:
 
== Original Conversation with Jimmy O'Regan ==
 
== Original Conversation with Jimmy O'Regan ==
   
  +
* [16: 04] <jimregan> Jacob and I designed an alternative to flag
+
* [16: 04] <jimregan> Jacob and I designed an alternative to flag
diacritics on the back of some paper plates last year
+
diacritics on the back of some paper plates last year
* [16:04] <jimregan> using a second transducer
 
* [16:04] <jimregan> (the original motivation behind flag diacritics -
+
* [16:04] <jimregan> using a second transducer
  +
* [16:04] <jimregan> (the original motivation behind flag diacritics -
which are an ugly hack - was to avoid using a second transducer)
+
which are an ugly hack - was to avoid using a second transducer)
* [16:05] <jimregan> we use a fairly large number of transducers, in
+
* [16:05] <jimregan> we use a fairly large number of transducers, in
almost every phase of the translation pipeline
+
almost every phase of the translation pipeline
* [16:05] <jimregan> so we thought it wouldn't be unreasonable to have
+
* [16:05] <jimregan> so we thought it wouldn't be unreasonable to have
a second transducer to use in state pruning, rather than just
+
a second transducer to use in state pruning, rather than just
balancing special symbols
+
balancing special symbols
* [16:07] <jimregan> the second transducer would basically take the
 
 
* [16:07] <jimregan> the second transducer would basically take the
same form as translation rules in apertium-transfer, but with the
+
same form as translation rules in apertium-transfer, but with the
ability to fully or partially lexicalise (possibly based on a subset
+
ability to fully or partially lexicalise (possibly based on a subset
of regexes)
+
of regexes)
* [16:07] <jimregan> if the initial analysis involves a continuation,
+
* [16:07] <jimregan> if the initial analysis involves a continuation,
pass it to the second for validation
+
pass it to the second for validation
* [16:07] <jmcejuela> aha
 
* [16:07] <jimregan> prune it if it doesn't match
+
* [16:07] <jmcejuela> aha
* [16:08] <jimregan> take the symbols of the second transduction for
+
* [16:07] <jimregan> prune it if it doesn't match
  +
* [16:08] <jimregan> take the symbols of the second transduction for
output if it does
+
output if it does
* [16:08] <jimregan> Jacob was pretty pationate about implementing it
+
* [16:08] <jimregan> Jacob was pretty pationate about implementing it
himself, but I think he'd be interested in mentoring it
+
himself, but I think he'd be interested in mentoring it
* [16:10] <jimregan> I just remembered how to summarise it: two level
+
* [16:10] <jimregan> I just remembered how to summarise it: two level
morphology as cascading transducers
 
* [16:11] <jimregan> but, as it's cascading, two level could
+
morphology as cascading transducers
  +
* [16:11] <jimregan> but, as it's cascading, two level could
potentially be made n-level
+
potentially be made n-level
  +
  +
  +
[[Category:Flag diacritics]]

Latest revision as of 06:41, 20 October 2014

This proposal stems from (1) the suggested project by Jimmy O'Regan in an irc conversation, as a better an desired alternative to their listed idea for GSoC2011 Flag diacritics in lttoolbox using instead second level transducers, and (2) from my own expanded proposal to study and use topology-learned transducers to be placed as such 2nd level transducers.


Short Description[edit]

The implementation of a module to handle languages with infix inflection (and possibly other forms), to stop, according to defined constraints, useless and invalid continuation computations as flag diacritics do, for example in the HFST platform, but using instead the novel approach of a second/n level. This is approach is desired as transducers may provide greater power, expressiveness, and flexibility.


Description[edit]

The HFST platform uses flag diacritics to remove and stop the computation of illegal compounds, thus providing a better handling of languages with infix inflection. The Apertium project aims for this appropriate handling for such languages and, as documented in their original idea, planned to used flag diacritics as well.

Jimmy O'Regan suggest, however, that a better approach would be to use a second level of cascaded transducers to process the same continuations and, according to the transducers' decision, either reject the input, prune the states of the otherwise continued computation, and finally emit an epsilon symbol as flag diacritics defined by constraint rules do, or process the input and output the transducer's computation.

The pruning of states module is already implemented in lttoolbox-java by Jacob Nordfalk, therefore my work will consist of the (1) design and implementation of such second-level, by nature cascaded n-level if considered sensible, layer of transducers, (2) the consequent changes in the FST compiling processing code and the pipeline to this other level of transducers, (3) with a limited range, verify and validate such implementation with a sample language, (4) the documentation of such development.


Then, and more exclusively for my master thesis, with a higher research and scientific scope, I propose to study topology transducer learning to construct such n-level transducers, working with some learning corpora, and mostly using the OSTIA state-merging algorithm.


The programming language to use will be mostly Java, using C++ if required.


Contribution[edit]

  • The implementation of a module for the flexible handling of languages with infix inflection, and possibly other types, that makes possible to avoid the possibly infinite definition of not regular rules in a dictionary to work with more complex forms of inflection.
  • The novelty, as far as I know, of the use of transducers to tackle such a problem.
  • Altogether, a better support for languages with more complex forms of inflection.


Work Plan: Timeline[edit]

I mostly consider here the exact and specific work for the GSoC project, not the thesis's. The following is an estimation:


  • Community Bonding Period: April 25 - May 23:
  • know well Apertium's project and internals; get to know the community; know work & code standards; review C++; study Apertium's architecture and code
  • study the lttoolbox-java
  • study problem to solve; study flag diacritics approach; design plan solution for the problem


  • Start
  • Week 1-3: code implementation; parallel testing & documentation
  • Week 4: code implementation; start of more formal testing
  • Deliverable #1: module ready for the lttoolbox-java & documentation


  • Verification on target language
  • Week 5: apply module to some specific language and validate correctness
  • Deliverable #2: positive formal verification results


  • Topology Learning
  • Week 6: research state-of-the-art methods and algorithms; design solution; ultimate corpus for which to apply the learning method
  • Week 7-10: implement, review, test, and document the topology learning algorithm
  • Week 11: verify algorithm
  • Deliverable #3: topology learning algorithm module ready & documentation


  • Set & Run
  • Week 12: (1) Deliver transducer/flag diacritics module into official program; (2) deliver topology learning module into official program


  • Project Completed: Deliverable #4: verified & delivered code & documentation


Original Conversation with Jimmy O'Regan[edit]

 * [16: 04] <jimregan> Jacob and I designed an alternative to flag
 diacritics on the back of some paper plates last year
 * [16:04]  <jimregan> using a second transducer
 * [16:04]  <jimregan> (the original motivation behind flag diacritics -
 which are an ugly hack - was to avoid using a second transducer)
 * [16:05]  <jimregan> we use a fairly large number of transducers, in
 almost every phase of the translation pipeline
 * [16:05]  <jimregan> so we thought it wouldn't be unreasonable to have
 a second transducer to use in state pruning, rather than just
 balancing special symbols
 * [16:07]  <jimregan> the second transducer would basically take the
 same form as translation rules in apertium-transfer, but with the
 ability to fully or partially lexicalise (possibly based on a subset
 of regexes)
 * [16:07]  <jimregan> if the initial analysis involves a continuation,
 pass it to the second for validation
 * [16:07]  <jmcejuela> aha
 * [16:07]  <jimregan> prune it if it doesn't match
 * [16:08]  <jimregan> take the symbols of the second transduction for
 output if it does
 * [16:08]  <jimregan> Jacob was pretty pationate about implementing it
 himself, but I think he'd be interested in mentoring it
 * [16:10]  <jimregan> I just remembered how to summarise it: two level
 morphology as cascading transducers
 * [16:11]  <jimregan> but, as it's cascading, two level could
 potentially be made n-level