User:Firespeaker/Templatic bidix

From Apertium
Jump to navigation Jump to search

I have this idea that I think would make translations better (via more explicit mappings between languages as well as arbitrary structure mapping) and development easier. This would work by offloading disambiguation and "syntax" to bidix via bidix accepting "translation templates" instead of "words".

This would create a few issues:

  • The user would then have to know the languages in depth to even really being working on a bidix. But isn't this already the ideal case?
  • Addition to / rewrite of bidix (maybe best to fork it and release it as something different)
  • Some way to deal with ranking of preference between different possible mappings
    • Tokenisation / longest-match

Test cases

English/Turkic translations mostly

a long example

  • Хип-хоптун алгачкы хореографы, америкалык өнөрпоз жергиликтүү бийчилер менен жолугушуп, хип-хоп аркылуу ич ара араздашууну жөнгө салуу тажрыйбасын көрсөтүп, маданияттын бул түрү аркылуу жаштарды туура жолго салып, ак жолтой келечек курса болот деген көз карашын жайылтууда.
  • The first hip-hop choreographer, an American specialist, met with local dancers, presented his experience in settling internal disagreements through hip-hop, and advanced his stance that through this sort of culture you can set youth on the right path and built a bright future.

mappings needed

  • [1]<n><gen> [2]<det> [3]<n><px3sp> = the [2]<det> [1]<n> [3]<n>
  • хип-хоп<n> = hip-hop<n>
  • алгачкы<det> = first<det>
  • хореограф<n> = choreographer<n>
  • америкалык<adj> = American<adj>
  • өнөрпоз<n> = specialist<n>
    • {{{1}}}
  • жергиликтүү<adj> = local<adj>
  • бийчи<n> = dancer<n>
  • <pl> = <pl> (a fall-back default?)
  • ( [1 <n>|(<np>.*)] ~ [2 <n>|(<np>.*)] менен ) жолук<v><coop>[3 _tags_] = [1] meet<v>[3] with<prep> [2]
  • {{{1}}}
  • [1 <n>] аркылуу<post> = via<prep> [1]<n>
    • [1 <n>] аркылуу<post> = through<prep> [1]<n>
  • {{{1}}}
  • араздашууну жөнгө салуу тажрыйбасын көрсөтүп, маданияттын бул түрү аркылуу жаштарды туура жолго салып, ак жолтой келечек курса болот деген көз карашын жайылтууда.

GSoC task

Templatic bidix (Hard)

How? (required skills)

Python, XML, C++

What? (description)

Design a format similar to bidix (declarative XML establishing language 1 <> language 2 correspondences) that allows the use of templates, as well as the back-end to process it (i.e., it should compile into an FST). It should deal with discontiguous multiwords and complex multiwords, allowing them to be easily translated, and should provide some mechanism (some sort of ranking) to deal with multiple matching sets of templates for a given translation (similar to CG). It should essentially allow one to bypass transfer rules and disambiguation and produce similar (if not better) accuracy in translation.

Why? (rationale)

A templatic bidix forces the designer of a language pair to be more explicit, and also consolidates pair development. Furthermore, there are several types of phenomenon such a system could deal with that are currently highly problematic.

Who? (mentors)

Jonathan