Difference between revisions of "Bilingual dictionary discovery"

From Apertium
Jump to navigation Jump to search
Line 25: Line 25:
 
* Only one word per input language
 
* Only one word per input language
 
* Prune words with only a single output arc.
 
* Prune words with only a single output arc.
  +
* Only accept words where there is a cycle(?)
   
 
Some ideas:
 
Some ideas:
   
 
* Weighting
 
* Weighting
  +
** Outgoing arcs get 1/number of arcs?
 
* Using more monolingual data, e.g. each word gets an SL concordance/context vector.
 
* Using more monolingual data, e.g. each word gets an SL concordance/context vector.
   

Revision as of 08:11, 13 July 2014

This page describes a way of discovering new bilingual, or multilingual dictionaries.

We already have apertium-dixtools for crossing dictionaries, but what happens if you want to make a pair where there are no direct crossings available, or alternatively you want to enhance the accuracy of the crossing, or you want to maximise the number of correspondences you can get.

We can try using multiple input dictionaries.

Let's say you want to make a Chuvash--Tatar dictionary, and you have:

  • Chuvash--Russian
  • Chuvash--Turkish
  • Turkish--Russian
  • Turkish--Tatar
  • Russian--Tatar

You could make a graph out of these dictionaries where each node is a word in a language, and each arc is a language pair. For example like: http://i.imgur.com/SFOsRMv.png

Pruned bilingual dictionary graphs.

You could then cluster the words using some "strongly-connected subgraph"[1] algorithm. Then assume that the sets of words within a strongly-connected subgraph are translations of each other. Meaning that you could get кил--йорт without having any direct correspondence.

Restrictions on sub-graphs:

  • Only one word per input language
  • Prune words with only a single output arc.
  • Only accept words where there is a cycle(?)

Some ideas:

  • Weighting
    • Outgoing arcs get 1/number of arcs?
  • Using more monolingual data, e.g. each word gets an SL concordance/context vector.

Notes

Further reading