Difference between revisions of "Apertium Turkic/TODO"

From Apertium
Jump to navigation Jump to search
Line 38: Line 38:
 
=== Issues introduced by new build process ===
 
=== Issues introduced by new build process ===
 
* How can we do single-category testvoc now?
 
* How can we do single-category testvoc now?
  +
** Since Turkic languages have very few paradigms, we can just use a representative stem for each paradigm and do a testvoc on that prefix of the source-language transducer. Instructions to come.
 
* How can we make vanilla transducers (without MT-specific "wrong" POSes)
 
* How can we make vanilla transducers (without MT-specific "wrong" POSes)
 
** The problem is that "! Use/xxx-yyy" lines can't just be grepped out in the vanilla transducer anymore, since those are needed for the xxx-yyy transducers. That is, we're no longer just copying the lexc file, but copying the full transducer (no trimming before compilation), and trimming the transducer directly (based on the bidix) for use in pairs.
 
** The problem is that "! Use/xxx-yyy" lines can't just be grepped out in the vanilla transducer anymore, since those are needed for the xxx-yyy transducers. That is, we're no longer just copying the lexc file, but copying the full transducer (no trimming before compilation), and trimming the transducer directly (based on the bidix) for use in pairs.

Revision as of 21:28, 15 January 2014

This is a general to-do list for the Apertium Turkic working group.

Website

This section outlines what's left to get http://turkic.apertium.com/ up and running.

software infrastructure

optional: spell checker and language detection stuff

what to include

make the following pairs available to the site:

  • pairs: kaz-tat, tur-kir, kaz-kir, tat-bak, kaz-kaa, tuk-tur?, tur-uzb?, kaz-eng?
  • transducers: kaz, tat, kir, tur, bak, chv, kum, nog, kaa, uzb?, tuk?

prettifying

future

  • consider including the web concordancer on the site (and consider what corpora to provide search access to...)

Things that need to be figured out

Issues introduced by new build process

  • How can we do single-category testvoc now?
    • Since Turkic languages have very few paradigms, we can just use a representative stem for each paradigm and do a testvoc on that prefix of the source-language transducer. Instructions to come.
  • How can we make vanilla transducers (without MT-specific "wrong" POSes)
    • The problem is that "! Use/xxx-yyy" lines can't just be grepped out in the vanilla transducer anymore, since those are needed for the xxx-yyy transducers. That is, we're no longer just copying the lexc file, but copying the full transducer (no trimming before compilation), and trimming the transducer directly (based on the bidix) for use in pairs.
  • How can we count trimmed stems?