Difference between revisions of "Extracting bilingual dictionaries with Giza++"

Revision as of 13:39, 7 February 2012

Let's take for example the forvaltningsordbok Norwegian--North Sámi corpus. It will have two files:

Check to see if the files are the same length:

$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme 
  161837 forvaltningsordbok.nob
  161837 forvaltningsordbok.sme
  323674 total

If the files are not the same length, then you need to go back and check your sentence alignment.

@@ Line 4: / Line 4: @@
 ==Get your corpus==
+Let's take for example the <code>forvaltningsordbok</code> Norwegian--North Sámi corpus. It will have two files:
+* <code>forvaltningsordbok.nob</code>: A list of sentences in Norwegian
+* <code>forvaltningsordbok.sme</code>: Translations of the previous sentences in North Sámi
+Check to see if the files are the same length:
+<pre>
+$ wc -l forvaltningsordbok.nob forvaltningsordbok.sme
+forvaltningsordbok.nob
+forvaltningsordbok.sme
+total
+</pre>
+If the files are not the same length, then you need to go back and check your sentence alignment.
 ==Process corpus==