Difference between revisions of "Ideas for Google Summer of Code/Dictionary induction from parallel corpora"
Jump to navigation
Jump to search
Popcorndude (talk | contribs) (Created page with "== Coding Challenge == Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.") |
Popcorndude (talk | contribs) |
||
Line 2: | Line 2: | ||
Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words. |
Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words. |
||
+ | |||
+ | $ cat eng.txt |
||
+ | The cat ate the fish. |
||
+ | $ cat spa.txt |
||
+ | El gato comió el pez. |
||
+ | $ alignment-script apertium-eng/ eng.txt apertium-spa/ spa.txt |
||
+ | the<det><def><mf><sp> - el<det><def><m><sg> |
||
+ | cat<n><sg> - gato<n><m><sg> |
||
+ | ... |
Revision as of 16:10, 8 March 2024
Coding Challenge
Write a script that reads two parallel corpora, applies the monolingual taggers and a word-aligner, and then prints a list of paired words.
$ cat eng.txt The cat ate the fish. $ cat spa.txt El gato comió el pez. $ alignment-script apertium-eng/ eng.txt apertium-spa/ spa.txt the<det><def><mf><sp> - el<det><def><m><sg> cat<n><sg> - gato<n><m><sg> ...