Ideas for Google Summer of Code/Template-based bilingual dictionary

From Apertium
< Ideas for Google Summer of Code
Revision as of 18:57, 29 January 2014 by Francis Tyers (talk | contribs) (→‎Tasks)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Example of use[edit]

Argument structure mapping[edit]

One of the primary uses of a templatic bidix will be correspondences where the argument structure is different. Note the following example:

  • (es) A mi hermana le gustan los gatos → My sister likes cats

An entry (template) for gustar<>like should take into account the following facts:

  • Word order
    • The order of the constituents of gustar can be in various orders
  • Person/number agreement
    • The verb gustar needs to agree with its subject, which in turn is the object of like
    • The verb like needs to agree with its subject, which is in turn the object of gustar
  • Other words
    • Intervening words like adjectives and adverbs should be ignored and translated by another entry in the bidix
  • Tense
    • The tense of the verbs should correspond
    • this should be fairly simple if there is a list of tags that should transfer if otherwise untouched

Discontiguous multiwords[edit]

Another use of a templatic bidix is discontiguous multiwords. There are lots of noun+verb phrasal verbs in Turkic languages, especially ones where the possessor of the noun (and not the subject of the verb) translates as the subject of the verb in English. E.g.,

  • (kir) Жиним келдиI got angry
^жиним/жин<n><px1sg><nom>$ ^келди/кел<v><iv><ifi><p3sp>/кел<vaux><ifi><p3sp>$
^I/prpers<prn><subj><p1><mf><sg>$ ^got/get<vblex><past>$ ^angry/angry<adj>$

In this case, the template for "жин кел" <> "get mad" should match both patterns basically, but include the following as well:

  • specification for the possession tag on жин and a directive that it should map onto the subject of "get mad" (and have "get" agree in person/number with it)
  • specification that the tense of "кел" should be transferred to "get"
  • identification of an optional possessor pronoun (менин жиним келди) or noun (аялымдын жини келди) in genitive modifying "жин" which can become the subject of "get"

In addition,

  • intervening adverbs ("Жиним аябай/жаман келди" <> "I got really mad") should be transferred correctly (should happen by default?)


A project to create a templatic bidix format involves the following major steps:

  • Designing an XML format based on the current bidix format to match templates
  • A tool (based on the current bidix processor) to compile this XML format into a FST or CFG
  • Include support for discontiguous multiwords in an existing language pair, Kazakh-Tatar
  • Work out how to deal with formatting (superblanks).

Coding Challenge[edit]

  • Install an existing language pair where one of the languages has discontiguous multiwords, Kazakh-Tatar
  • Modify bidix so that it restarts lookup on the <j/> symbol
    • A way to test whether this worked

Frequently asked questions[edit]

  • none yet, ask us something! :)

See also[edit]