Difference between revisions of "Corpora"
Jump to navigation
Jump to search
(Tatoeba Project) |
(→Corpora: change tatoeba.fr to tatoeba.org as .fr is redirected to .org) |
||
Line 14: | Line 14: | ||
* OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora |
* OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora |
||
* Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases) |
* Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases) |
||
* Tatoeba Project — http://tatoeba. |
* Tatoeba Project — http://tatoeba.org/ — Database of example sentences translated into several languages. |
||
== Corpus tools == |
== Corpus tools == |
Revision as of 19:29, 30 January 2010
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).
You might also want to use Wikipedia as a corpus, see Tagger_training#Creating_a_corpus or Building_dictionaries#Wikipedia_dumps and the cleanup script at Calculating_coverage.
Corpora
- EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language
- Use this if you want to do English--<something> (funny alignments for non-English pairs)
- JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages
- Use this if you want to do <anything>--<anything>
- Southeast European Times — http://xixona.dlsi.ua.es/~fran/setimes/ — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words.
- South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words.
- IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian
- OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora
- Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases)
- Tatoeba Project — http://tatoeba.org/ — Database of example sentences translated into several languages.
Corpus tools
- Corpus Catcher — http://translate.sourceforge.net/wiki/corpuscatcher/index - Bootstrap corpora from the web
- BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
- Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web