Apertium Turkic/TODO
< Apertium Turkic
Jump to navigation
Jump to search
Revision as of 11:58, 16 January 2014 by Unhammer (talk | contribs) (→Issues introduced by new build process)
This is a general to-do list for the Apertium Turkic working group.
Website
This section outlines what's left to get http://turkic.apertium.com/ up and running.
software infrastructure
Get apertium-apy working stably- merge simple-html and html-tools so that simple-html can be automatically extracted from html-tools
apache forwarding for html-tools(unnecessary!)- init scripts and cron testers for apertium-html-tools, gateway, and apertium-apy
- find some way to have it retry restarting if it fails because the port is still reserved by the OS
optional: spell checker and language detection stuff
- spell checking mode in apertium-apy
- integrate spell checker interface into html-tools
get language detection interface working- language detection mode in apertium-apy (prototype done)
what to include
make the following pairs available to the site:
- pairs: kaz-tat, tur-kir, kaz-kir, tat-bak, kaz-kaa, tuk-tur?, tur-uzb?, kaz-eng?
- transducers: kaz, tat, kir, tur, bak, chv, kum, nog, kaa, uzb?, tuk?
prettifying
localised language names in analysis, generation, and spell-check modesget a working theme togethermake sandbox mode disabled unless an appropriate switch is passed to apertium-html-toolsadd a note (localised to various languages) along the lines of "Found a mistake? Help us fix it!" with link to Apertium Turkic
future
- consider including the web concordancer on the site (and consider what corpora to provide search access to...)
Things that need to be figured out
How can we count lexc stems effectively? - JNW's bash script can be generalised (and rewritten in python), and it'll come closesee The Right Way to count lexc stems
Issues introduced by new build process
- How can we do single-category testvoc now?
- Since Turkic languages have very few paradigms, we can just use a representative stem for each paradigm and do a testvoc on that prefix of the source-language transducer. Instructions to come.
- How can we make vanilla transducers (without MT-specific "wrong" POSes)
- The problem is that "! Use/xxx-yyy" lines can't just be grepped out in the vanilla transducer anymore, since those are needed for the xxx-yyy transducers. That is, we're no longer just copying the lexc file, but copying the full transducer (no trimming before compilation), and trimming the transducer directly (based on the bidix) for use in pairs. Ideas: /Use/MT
- How can we count trimmed stems?
- Counting unique stems on each side of the bidix should give us the equivalent.