Difference between revisions of "Corpora"
Jump to navigation
Jump to search
m |
|||
Line 4: | Line 4: | ||
* EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language |
* EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language |
||
::Use this if you want to do English--<something> (funny alignments for non-English pairs) |
|||
* JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages |
* JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages |
||
::Use this if you want to do <anything>--<anything> |
|||
* Southeast European Times — http://xixona.dlsi.ua.es/~fran/setimes/ — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words. |
* Southeast European Times — http://xixona.dlsi.ua.es/~fran/setimes/ — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words. |
||
* South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words. |
* South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words. |
Revision as of 23:55, 23 March 2008
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.)
Corpora
- EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language
- Use this if you want to do English--<something> (funny alignments for non-English pairs)
- JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages
- Use this if you want to do <anything>--<anything>
- Southeast European Times — http://xixona.dlsi.ua.es/~fran/setimes/ — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words.
- South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words.
- IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian