Difference between revisions of "User:Firespeaker/Apertium-turkic talk outline"

From Apertium
Jump to navigation Jump to search
(Created page with '== Morphological transducers: what and why == * slide 1: definition, example (sample input/output) * slide 2: use in RBMT, specifically apertium * slide 3: other uses: spell chec…')
 
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
Sketch for talk on [http://cl.indiana.edu/wiki/Fall2012ClingDing Writing Turkic-language morphological transducers using HFST (for MT)] on October 2nd.
== Abstract ==
This talk will outline the development of Free/open-source morphological transducers for Turkic languages using HFST (the Helsinki Finite State Toolkit). Morphological, phonological, and orthographical challenges encountered in Turkic languages are reviewed, and functioning solutions are presented. Also included are reasons for developing morphological transducers, how these can benefit communities that use the languages, and the current development status of various Turkic morphological transducers.

== Morphological transducers: what and why ==
== Morphological transducers: what and why ==
* slide 1: definition, example (sample input/output)
* slide 1: definition, example (sample input/output)
* slide 2: use in RBMT, specifically apertium
* slide 2: use in RBMT, specifically apertium
** apertium as MT for lesser-used/marginalised languages
* slide 3: other uses: spell checkers, ...?
* slide 3: other uses: spell checkers, ...?
* also mention: why we use orthography, not some transcription
** spell-checkers
** accessibility by native speakers (as devs and as end-users)
** no need for pre/post-processing
** effect: it makes it harder, but means it's useful to communities that use the language


== Turkic languages ==
== Turkic languages ==
Line 9: Line 19:
** a map, numbers of speakers, wikipedia presence
** a map, numbers of speakers, wikipedia presence
=== Morphological and phonological properties encountered in Turkic languages ===
=== Morphological and phonological properties encountered in Turkic languages ===
(these are all to be taken as "challenges for morphological transducers")
* slide 5: Agglutination
* slide 5: Agglutination
* slide 6: Vowel harmony
* slide 6: Vowel harmony
* slide 7: Consonantal processes
* slide 7: Consonantal processes
* slide 8: "buffer" segments
* slide 8: "buffer" segments
* slide 9: Cyrillic orthographical issues
* slide 9: phonology of numerals and acronyms
* slide 10: Cyrillic orthographical issues
* something on morpho-syntactic issues that've come up a lot? E.g.,
* something on morpho-syntactic issues that've come up a lot
** no suffix can attach to "any word", "any part of speech" or even e.g. "all nouns"; often suffixes recur in very specific sorts of places; it's almost like we have dozens of POSes
*** We don't want to overanalyse(/overgenerate)
**** disambig issues
**** testvoc issues
** Adjective classes (e.g., whether used as {{tag|attr}}/{{tag|subst}}/{{tag|advl}}, +comparative, etc.)
** Adjective classes (e.g., whether used as {{tag|attr}}/{{tag|subst}}/{{tag|advl}}, +comparative, etc.)
** Non-finite verb forms
** Non-finite verb forms
Line 23: Line 39:
** a corpus
** a corpus
** some grammars and dictionaries
** some grammars and dictionaries
** linguistic knowledge of the language
** linguistic knowledge of the language (if you want to get into it deeply)
** native speakers!
** native speakers!
*** ability to work with informants
*** ability to work with informants
*** patience!
*** patience!
*** cf. Chuvash (i.e., the native speakers hopefully agree on forms)
=== HFST and how we use it ===
=== HFST and how we use it ===
* slide: HFST: what and who
* slide: HFST: what and who
* slide: our purposes: using two two-level systems together for a three-level system (?):
* slide: our purposes: using two two-level systems together for a three-level system (?):
** slide: overview of <tt>lexc</tt>
** slide: overview of <tt>lexc</tt> and why it was chosen
** slide: overview of <tt>twol</tt>
** slide: overview of <tt>twol</tt> and why it was chosen


=== Examples: how morphophonological issues above are dealt with ===
=== Examples: how morphophonological issues above are dealt with ===
Line 40: Line 57:
== State of affairs now with apertium-turkic ==
== State of affairs now with apertium-turkic ==
* [[Turkic languages]]
* [[Turkic languages]]
* mailing list
* future work
** disambiguation
** more pairs
** more languages

Latest revision as of 08:27, 1 October 2012

Sketch for talk on Writing Turkic-language morphological transducers using HFST (for MT) on October 2nd.

Abstract[edit]

This talk will outline the development of Free/open-source morphological transducers for Turkic languages using HFST (the Helsinki Finite State Toolkit). Morphological, phonological, and orthographical challenges encountered in Turkic languages are reviewed, and functioning solutions are presented. Also included are reasons for developing morphological transducers, how these can benefit communities that use the languages, and the current development status of various Turkic morphological transducers.

Morphological transducers: what and why[edit]

  • slide 1: definition, example (sample input/output)
  • slide 2: use in RBMT, specifically apertium
    • apertium as MT for lesser-used/marginalised languages
  • slide 3: other uses: spell checkers, ...?
  • also mention: why we use orthography, not some transcription
    • spell-checkers
    • accessibility by native speakers (as devs and as end-users)
    • no need for pre/post-processing
    • effect: it makes it harder, but means it's useful to communities that use the language

Turkic languages[edit]

Geographical/demographic overview of Turkic languages[edit]

  • slides 4, 5?
    • a map, numbers of speakers, wikipedia presence

Morphological and phonological properties encountered in Turkic languages[edit]

(these are all to be taken as "challenges for morphological transducers")

  • slide 5: Agglutination
  • slide 6: Vowel harmony
  • slide 7: Consonantal processes
  • slide 8: "buffer" segments
  • slide 9: phonology of numerals and acronyms
  • slide 10: Cyrillic orthographical issues
  • something on morpho-syntactic issues that've come up a lot
    • no suffix can attach to "any word", "any part of speech" or even e.g. "all nouns"; often suffixes recur in very specific sorts of places; it's almost like we have dozens of POSes
      • We don't want to overanalyse(/overgenerate)
        • disambig issues
        • testvoc issues
    • Adjective classes (e.g., whether used as <attr>/<subst>/<advl>, +comparative, etc.)
    • Non-finite verb forms
    •  ?

Developing a morphological transducer[edit]

  • Important resources to start with:
    • a corpus
    • some grammars and dictionaries
    • linguistic knowledge of the language (if you want to get into it deeply)
    • native speakers!
      • ability to work with informants
      • patience!
      • cf. Chuvash (i.e., the native speakers hopefully agree on forms)

HFST and how we use it[edit]

  • slide: HFST: what and who
  • slide: our purposes: using two two-level systems together for a three-level system (?):
    • slide: overview of lexc and why it was chosen
    • slide: overview of twol and why it was chosen

Examples: how morphophonological issues above are dealt with[edit]

  • bing
  • bang
  • bam

State of affairs now with apertium-turkic[edit]

  • Turkic languages
  • mailing list
  • future work
    • disambiguation
    • more pairs
    • more languages