Ideas for Google Summer of Code/Dictionary induction from parallel corpora
Jump to navigation
Jump to search
Coding Challenge[edit]
Write a script that reads two parallel corpora, applies the appropriate monolingual taggers and some word-aligner (eflomal is pretty straightforward if you don't know where to begin), and then prints a list of paired words.
$ cat eng.txt The cat ate the fish. $ cat spa.txt El gato comió el pez. $ alignment-script apertium-eng/ eng.txt apertium-spa/ spa.txt the<det><def><mf><sp> - el<det><def><m><sg> cat<n><sg> - gato<n><m><sg> ...