Difference between revisions of "Converting a bilingual dictionary to Grammatical Framework"

From Apertium
Jump to navigation Jump to search
(Created page with " First, find the bilingual dictionary you want and convert it to text format: <pre> $ lt-expand apertium-en-ca.en-ca.dix | sed 's/:[><]:/:/g' | sort -u > /tmp/en-ca.txt </pr...")
 
m
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
{{TOCD}}
   
  +
This page describes a process to convert an Apertium [[bilingual dictionary]] into a format appropriate for use by Grammatical Framework.
  +
  +
==Things you'll need==
  +
  +
* A bilingual dictionary in Apertium format, e.g. [https://svn.code.sf.net/p/apertium/svn/trunk/apertium-en-ca/apertium-en-ca.en-ca.dix apertium-en-ca.en-ca.dix]
  +
* A model GF Dictionary, e.g. <code>grammaticalframework/lib/src/translator/DictionarySpa.gf</code>
  +
  +
==Process==
   
 
First, find the bilingual dictionary you want and convert it to text format:
 
First, find the bilingual dictionary you want and convert it to text format:
Line 7: Line 16:
 
</pre>
 
</pre>
   
The GF wide-coverage translator has a lexicon that you should be working from. You can get a list of words as follows:
+
The GF wide-coverage translator has a lexicon that you should be working from. It's an interlingua system where everything goes via English, so it's best to just start from the English dictionary which will be most complete. You can get a list of words as follows:
   
 
<pre>
 
<pre>
Line 86: Line 95:
   
 
<pre>
 
<pre>
  +
$ gf DictionaryCat.gf
  +
  +
...
  +
 
DictionaryCat.gf:394:178:
 
DictionaryCat.gf:394:178:
 
syntax error
 
syntax error
 
</pre>
 
</pre>
   
Congratulations, you have just imported a dictionary from Apertium into GF!
+
Congratulations, you have just imported a dictionary from Apertium into GF!
  +
  +
==Testing==
  +
  +
You can test it by compiling:
  +
  +
<pre>
  +
$ gf DictionaryCat.gf
  +
write file /home/fran/source/grammaticalframework/lib/src/translator/DictionaryCat.gfo
  +
linking ... OK
  +
  +
Languages: DictionaryCat
  +
</pre>
  +
  +
Then you will get a prompt:
  +
  +
<pre>
  +
Dictionary> lin -all language_N
  +
llengua
  +
llengües
  +
  +
0 msec
  +
</pre>
  +
  +
If you get "llengua" and "llengües" as the linearisations (<code>lin</code>) of <code>language_N</code>, then it worked!
   
 
[[Category:Documentation]]
 
[[Category:Documentation]]
  +
[[Category:Documentation in English]]
  +
[[Category:Grammatical Framework]]

Latest revision as of 11:52, 26 September 2016

This page describes a process to convert an Apertium bilingual dictionary into a format appropriate for use by Grammatical Framework.

Things you'll need[edit]

  • A bilingual dictionary in Apertium format, e.g. apertium-en-ca.en-ca.dix
  • A model GF Dictionary, e.g. grammaticalframework/lib/src/translator/DictionarySpa.gf

Process[edit]

First, find the bilingual dictionary you want and convert it to text format:

$ lt-expand apertium-en-ca.en-ca.dix | sed 's/:[><]:/:/g' | sort -u > /tmp/en-ca.txt

The GF wide-coverage translator has a lexicon that you should be working from. It's an interlingua system where everything goes via English, so it's best to just start from the English dictionary which will be most complete. You can get a list of words as follows:

$ cat DictionaryEng.gf  | grep '^lin' | cut -f2 -d' '  | sort -u > /tmp/en.txt

Make a list of categories in the GF lexicon:

$ cat /tmp/en.txt | sed 's/_[A-Za-z0-9]\+$/\t&/g' | cut -f2 | sort -u | sed 's/^_//g' > /tmp/en-cats.txt

Now go and edit the file en-cats.txt and add the equivalent Apertium tag to each line, e.g.

$ cat /tmp/en-cats.txt 
A:<adj>
A2:<adj>
AdA:<adv>
AdN:<adv>
...

Note: It is highly unlikely that there will be a one-to-one correspondence, just fill in as best as you can and leave ones you don't know.

Next we pick out the translations of the words along with their categories:

$ for i in `cat /tmp/en.txt`; do 
  ord=`echo $i | sed 's/_[A-Za-z0-9]\+$/\t&/g' | cut -f1 | sed 's/_/ /g'`; 
  kategg=`echo $i |  sed 's/_[A-Za-z0-9]\+$/\t&/g'  | cut -f2 | sed 's/^_//g'`; 
  katega=`cat /tmp/en-cats.txt | grep "^$kategg:" | cut -f2 -d':'`; 
  trads=`cat /tmp/en-ca.txt | grep "^$ord$katega" | cut -f2 -d':' | tr '\n' ':' | sed 's/:$//g'`; 
  echo -e $ord"\t"$kategg"\t"$katega"\t"$trads >> /tmp/en-ca-list.txt; 
done

Then all that is left is to convert the lines to GF format. This will depend greatly on the language involved, this example is with Catalan, so we're going to want to be able to at least add genders for the nouns. We can take the Spanish DictSpa.gf file as a model:

lin south_N = mkN "sur" masculine ;
lin space_N = mkN "espacio" ;
lin spain_PN = mkPN "España" ;
lin win_V = mkV "ganar" ;
lin write_V = mkV "escribir" | mkV "apuntar" ;

So... to convert:

$ cat /tmp/en-ca-list.txt  | sed 's/#/ /g' | cut -f2,4 | grep -P -v '\t$' | sed 's/:/ | /g' |\
  sed "s/\([A-Za-z'· -]\+\)\(<[a-z]\+>\)/\2\"\1\"/g"  | sed 's/<n>"/mkN "/g' | sed 's/<vblex>"/mkV "/g' |\
  sed 's/<pr>"/mkPrep "/g' | sed 's/<ij>"/mkInterj "/g' | sed 's/<adj>"/mkA "/g' | sed 's/<\(preadv\|adv\)>"/ mkAdv "/g' |\
  sed 's/<cnj\(adv\|coo\|sub\)>/mkConj "/g' | sed 's/" /"/g' | sed 's/|/ | /g' | sed 's/<\(sg\|pl\|pron\|sp\)>//g' |\
  sed 's/=/= /g' | sed 's/<\(GD\|mf\)>/ -- &/g' | sed 's/<f>/ feminine/g'  | sed 's/<m>/ masculine/g'  |\
  sed 's/--.*/; &/g' | sed 's/$/ ;/g' | sed 's/> ;/>/g' | sed 's/<sp>//g' > /tmp/en-ca.right

$ cat /tmp/en-ca-list.txt | cut -f1,2,4 | grep -P -v '\t$' | cut -f1 | sed 's/ /_/g' | sed 's/^/lin /g' > /tmp/en-ca.left 

$ paste /tmp/en-ca.left /tmp/en-ca.right | sed 's/\t/_/1' | sed 's/\t/ = /g' > /tmp/en-ca.gf

That's a lot of sed, well you could write a computer program with another programming language, that would be fine too.

Now you need to wrap that around some GF:

concrete DictionaryCat of Dictionary = CatCat
** open ParadigmsCat, MorphoCat, IrregCat, (L=LexiconCat), (S=SyntaxCat), (E = ExtraCat), Prelude in {



}

After you save that you can try compiling it... You will probably get lots of syntax errors, these can be fixed manually, just by deleting the lines:

$ gf DictionaryCat.gf 

...

DictionaryCat.gf:394:178:
   syntax error

Congratulations, you have just imported a dictionary from Apertium into GF!

Testing[edit]

You can test it by compiling:

$ gf DictionaryCat.gf 
  write file /home/fran/source/grammaticalframework/lib/src/translator/DictionaryCat.gfo
linking ... OK

Languages: DictionaryCat

Then you will get a prompt:

Dictionary> lin -all language_N
llengua
llengües

0 msec

If you get "llengua" and "llengües" as the linearisations (lin) of language_N, then it worked!