Difference between revisions of "Tasks for GCI: Crossing Dictionaries"
m (→What is dictionary crossing?: a little more) |
Popcorndude (talk | contribs) |
||
Line 41: | Line 41: | ||
By default, we use a simple "catch all" rule in crossdics, which gives a dictionary that is basically useless, but - more importantly - crossdics generates sets of patterns, sorted by frequency, to which we can add an action. By using these generated models, and by focussing on the most frequent patterns first, we can get a useful dictionary much quicker than would otherwise be possible. |
By default, we use a simple "catch all" rule in crossdics, which gives a dictionary that is basically useless, but - more importantly - crossdics generates sets of patterns, sorted by frequency, to which we can add an action. By using these generated models, and by focussing on the most frequent patterns first, we can get a useful dictionary much quicker than would otherwise be possible. |
||
[[Category:Tasks_for_Google_Code-in|Crossing dictionaries]] |
Latest revision as of 19:57, 12 April 2021
Tasks for GCI: Crossing Dictionaries[edit]
Many of our tasks are 'task families', the process is the same, only the languages involved are different. Crossing dictionaries is one such task. There is other information on the wiki pertaining to crossing dictionaries, but I would like to keep this document as self-contained as possible -- if you have a question that isn't answered here, ask on the Talk page, and I will update the page to answer your question.
Firstly, and most importantly, you are not required to know all three languages involved in the crossing. Any knowledge you may have will be helpful, but the intermediate language is only important in a few ways, and ultimately, only the two languages in the expected output are really important.
What is dictionary crossing?[edit]
Dictionary crossing, sometimes called triangulation, involves taking each word of one language in a bilingual dictionary, and using its translation in one dictionary as the lookup key in the second.
Let's say that we want to use English-Spanish and Spanish-Romanian (which Apertium has) to create a dictionary for English-Romanian (which Apertium does not have).
As an example, given the English-Spanish entries:
<e r="LR"><p><l>dog<s n="n"/></l><r>perro<s n="n"/><s n="GD"/></r></p></e> <e r="RL"><p><l>dog<s n="n"/></l><r>perro<s n="n"/></r></p></e>
and the Spanish-Romanian entry:
<e> <p> <l>perro<s n="n"/></l> <r>câine<s n="n"/></r> </p> </e>
we would ideally like to see the output:
<e r="LR"><p><l>dog<s n="n"/></l><r>câine<s n="n"/><s n="GD"/></r></p></e> <e r="RL"><p><l>dog<s n="n"/></l><r>câine<s n="n"/></r></p></e>
(Now, you would think that 'dog' would be a simple example, but even at this stage, we have to see some transfer details. We really do encourage anyone who is interested in taking on tasks with us to first take on a task around the New Language Pair HOWTO, which will give you some of the practical knowledge needed to perform our other tasks).
We won't get this ideal output by default. Unless instructed otherwise, dixtools will discard all direction restrictions, which are important in this case. This is why a crossing model is important - it allows us to look for specific patterns in both source dictionaries, and to specify the output when those patterns are matched.
By default, we use a simple "catch all" rule in crossdics, which gives a dictionary that is basically useless, but - more importantly - crossdics generates sets of patterns, sorted by frequency, to which we can add an action. By using these generated models, and by focussing on the most frequent patterns first, we can get a useful dictionary much quicker than would otherwise be possible.