Difference between revisions of "Tasks for GCI: Crossing Dictionaries"

From Apertium
Jump to navigation Jump to search
(start writing an explanation on the wiki instead of doing it in an email)
 
Line 18: Line 18:
 
</pre>
 
</pre>
   
and the Spanish-Romanian entries:
+
and the Spanish-Romanian entry:
   
 
<pre>
 
<pre>
Line 37: Line 37:
   
 
(Now, you would think that 'dog' would be a simple example, but even at this stage, we have to see some transfer details. We really do encourage anyone who is interested in taking on tasks with us to first take on a task around the New Language Pair HOWTO, which will give you some of the practical knowledge needed to perform our other tasks).
 
(Now, you would think that 'dog' would be a simple example, but even at this stage, we have to see some transfer details. We really do encourage anyone who is interested in taking on tasks with us to first take on a task around the New Language Pair HOWTO, which will give you some of the practical knowledge needed to perform our other tasks).
  +
  +
We won't get this ideal output by default. Unless instructed otherwise, dixtools will discard all direction restrictions, which are important in this case. This is why a ''crossing model'' is important - it allows us to look for specific patterns in both source dictionaries, and to specify the output when those patterns are matched.
  +
  +
By default, we use a simple "catch all" rule in crossdics, which gives a dictionary that is basically useless, but - more importantly - crossdics generates sets of patterns, sorted by frequency, to which we can add an action. By using these generated models, and by focussing on the most frequent patterns first, we can get a useful dictionary much quicker than would otherwise be possible.

Revision as of 01:01, 14 November 2010

Tasks for GCI: Crossing Dictionaries

Many of our tasks are 'task families', the process is the same, only the languages involved are different. Crossing dictionaries is one such task. There is other information on the wiki pertaining to crossing dictionaries, but I would like to keep this document as self-contained as possible -- if you have a question that isn't answered here, ask on the Talk page, and I will update the page to answer your question.

Firstly, and most importantly, you are not required to know all three languages involved in the crossing. Any knowledge you may have will be helpful, but the intermediate language is only important in a few ways, and ultimately, only the two languages in the expected output are really important.

What is dictionary crossing?

Dictionary crossing, sometimes called triangulation, involves taking each word of one language in a bilingual dictionary, and using its translation in one dictionary as the lookup key in the second.

Let's say that we want to use English-Spanish and Spanish-Romanian (which Apertium has) to create a dictionary for English-Romanian (which Apertium does not have).

As an example, given the English-Spanish entries:

    <e r="LR"><p><l>dog<s n="n"/></l><r>perro<s n="n"/><s n="GD"/></r></p></e>
    <e r="RL"><p><l>dog<s n="n"/></l><r>perro<s n="n"/></r></p></e>

and the Spanish-Romanian entry:

      <e>
        <p>
          <l>perro<s n="n"/></l>
          <r>câine<s n="n"/></r>
        </p>
      </e>

we would ideally like to see the output:

    <e r="LR"><p><l>dog<s n="n"/></l><r>câine<s n="n"/><s n="GD"/></r></p></e>
    <e r="RL"><p><l>dog<s n="n"/></l><r>câine<s n="n"/></r></p></e>

(Now, you would think that 'dog' would be a simple example, but even at this stage, we have to see some transfer details. We really do encourage anyone who is interested in taking on tasks with us to first take on a task around the New Language Pair HOWTO, which will give you some of the practical knowledge needed to perform our other tasks).

We won't get this ideal output by default. Unless instructed otherwise, dixtools will discard all direction restrictions, which are important in this case. This is why a crossing model is important - it allows us to look for specific patterns in both source dictionaries, and to specify the output when those patterns are matched.

By default, we use a simple "catch all" rule in crossdics, which gives a dictionary that is basically useless, but - more importantly - crossdics generates sets of patterns, sorted by frequency, to which we can add an action. By using these generated models, and by focussing on the most frequent patterns first, we can get a useful dictionary much quicker than would otherwise be possible.