Ideas for Google Summer of Code/Template-based bilingual dictionary
< Ideas for Google Summer of Code
Jump to navigation
Jump to search
Revision as of 13:45, 21 March 2013 by Francis Tyers (talk | contribs) (→Frequently asked questions)
Example of use
Argument structure mapping
One of the primary uses of a templatic bidix will be correspondences where the argument structure is different. Note the following example:
- (es) A mi hermana le gustan los gatos → My sister likes cats
An entry (template) for gustar<>like should take into account the following facts:
- Word order
- The order of the constituents of gustar can be in various orders
- Person/number agreement
- The verb gustar needs to agree with its subject, which in turn is the object of like
- The verb like needs to agree with its subject, which is in turn the object of gustar
- Other words
- Intervening words like adjectives and adverbs should be ignored and translated by another entry in the bidix
- Tense
- The tense of the verbs should correspond
- this should be fairly simple if there is a list of tags that should transfer if otherwise untouched
Discontiguous multiwords
Another use of a templatic bidix is discontiguous multiwords. There are lots of noun+verb phrasal verbs in Turkic languages, especially ones where the possessor of the noun (and not the subject of the verb) translates as the subject of the verb in English. E.g.,
- (kir) Жиним келди → I got angry
- ^жиним/жин
<n>
<px1sg>
<nom>
$ ^келди/кел<v>
<iv>
<ifi>
<p3sp>
/кел<vaux>
<ifi>
<p3sp>
$ - ^I/prpers<prn><subj><p1><mf><sg>$ ^got/get<vblex><past>$ ^angry/angry<adj>$
- ^жиним/жин
In this case, the template for "жин кел" <> "get mad" should match both patterns basically, but include the following as well:
- specification for the possession tag on жин and a directive that it should map onto the subject of "get mad" (and have "get" agree in person/number with it)
- specification that the tense of "кел" should be transferred to "get"
- identification of an optional possessor pronoun (менин жиним келди) or noun (аялымдын жини келди) in genitive modifying "жин" which can become the subject of "get"
In addition,
- intervening adverbs ("Жиним аябай/жаман келди" <> "I got really mad") should be transferred correctly (should happen by default?)
Tasks
A project to create a templatic bidix format involves the following major steps:
- Designing an XML format based on the current bidix format to match templates
- A tool (based on the current bidix processor) to compile this XML format into a FST
- Include support for discontiguous multiwords in an existing language pair, Kazakh-Tatar
- Work out how to deal with formatting (superblanks).
Coding Challenge
- Install an existing language pair where one of the languages has discontiguous multiwords, Kazakh-Tatar
- Modify bidix so that it restarts lookup on the
<j/>
symbol- A way to test whether this worked
Frequently asked questions
- none yet, ask us something! :)