Difference between revisions of "Extracting bilingual dictionaries with Giza++"
Jump to navigation
Jump to search
Line 4: | Line 4: | ||
==Get your corpus== |
==Get your corpus== |
||
Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files: |
|||
* <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian |
|||
* <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi |
|||
Check to see if the files are the same length: |
|||
<pre> |
|||
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme |
|||
161837 forvaltningsordbok.nob |
|||
161837 forvaltningsordbok.sme |
|||
323674 total |
|||
</pre> |
|||
If the files are not the same length, then you need to go back and check your sentence alignment. |
|||
==Process corpus== |
==Process corpus== |
Revision as of 13:39, 7 February 2012
This is a placeholder for documentation on how to use a parallel corpus and Giza++ to extract bilingual dictionaries for Apertium.
Get your corpus
Let's take for example the forvaltningsordbok
Norwegian--North Sámi corpus. It will have two files:
forvaltningsordbok.nob
: A list of sentences in Norwegianforvaltningsordbok.sme
: Translations of the previous sentences in North Sámi
Check to see if the files are the same length:
$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 161837 forvaltningsordbok.nob 161837 forvaltningsordbok.sme 323674 total
If the files are not the same length, then you need to go back and check your sentence alignment.
Process corpus
Align corpus
Extract lexicon
Prune with relative frequency
While waiting: This page will shorten the waiting time