Difference between revisions of "Ideas for Google Summer of Code/Dictionary induction from parallel corpora"

From Apertium
Jump to navigation Jump to search
(Created page with "== Coding Challenge == Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.")
 
Line 2: Line 2:
   
 
Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.
 
Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.
  +
  +
$ cat eng.txt
  +
The cat ate the fish.
  +
$ cat spa.txt
  +
El gato comió el pez.
  +
$ alignment-script apertium-eng/ eng.txt apertium-spa/ spa.txt
  +
the<det><def><mf><sp> - el<det><def><m><sg>
  +
cat<n><sg> - gato<n><m><sg>
  +
...

Revision as of 16:10, 8 March 2024

Coding Challenge

Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.

$ cat eng.txt
The cat ate the fish.
$ cat spa.txt
El gato comió el pez.
$ alignment-script apertium-eng/ eng.txt apertium-spa/ spa.txt
the<det><def><mf><sp> - el<det><def><m><sg>
cat<n><sg> - gato<n><m><sg>
...