Difference between revisions of "User:Firespeaker/Apertium-turkic talk outline"

From Apertium
Jump to navigation Jump to search
Line 5: Line 5:
* slide 2: use in RBMT, specifically apertium
* slide 2: use in RBMT, specifically apertium
* slide 3: other uses: spell checkers, ...?
* slide 3: other uses: spell checkers, ...?
* also mention: why we use orthography, not some transcription
** spell-checkers
** accessibility by native speakers
** no need for pre/post-processing


== Turkic languages ==
== Turkic languages ==

Revision as of 08:09, 1 October 2012

Sketch for talk on Writing Turkic-language morphological transducers using HFST (for MT) on October 2nd.

Morphological transducers: what and why

  • slide 1: definition, example (sample input/output)
  • slide 2: use in RBMT, specifically apertium
  • slide 3: other uses: spell checkers, ...?
  • also mention: why we use orthography, not some transcription
    • spell-checkers
    • accessibility by native speakers
    • no need for pre/post-processing

Turkic languages

Geographical/demographic overview of Turkic languages

  • slides 4, 5?
    • a map, numbers of speakers, wikipedia presence

Morphological and phonological properties encountered in Turkic languages

(these are all to be taken as "challenges for morphological transducers")

  • slide 5: Agglutination
  • slide 6: Vowel harmony
  • slide 7: Consonantal processes
  • slide 8: "buffer" segments
  • slide 9: phonology of numerals and acronyms
  • slide 10: Cyrillic orthographical issues
  • something on morpho-syntactic issues that've come up a lot
    • no suffix can attach to "any word", "any part of speech" or even e.g. "all nouns"; often suffixes recur in very specific sorts of places; it's almost like we have dozens of POSes
      • We don't want to overanalyse(/overgenerate)
        • disambig issues
        • testvoc issues
    • Adjective classes (e.g., whether used as <attr>/<subst>/<advl>, +comparative, etc.)
    • Non-finite verb forms
    •  ?

Developing a morphological transducer

  • Important resources to start with:
    • a corpus
    • some grammars and dictionaries
    • linguistic knowledge of the language (if you want to get into it deeply)
    • native speakers!
      • ability to work with informants
      • patience!
      • cf. Chuvash (i.e., the native speakers hopefully agree on forms)

HFST and how we use it

  • slide: HFST: what and who
  • slide: our purposes: using two two-level systems together for a three-level system (?):
    • slide: overview of lexc and why it was chosen
    • slide: overview of twol and why it was chosen

Examples: how morphophonological issues above are dealt with

  • bing
  • bang
  • bam

State of affairs now with apertium-turkic