Difference between revisions of "Corpora"

From Apertium
Jump to navigation Jump to search
(Category:Documentation in English)
Line 8: Line 8:
::Use this if you want to do English--<something> (funny alignments for non-English pairs)
::Use this if you want to do English--<something> (funny alignments for non-English pairs)
* JRC-Acquis &mdash; http://langtech.jrc.it/JRC-Acquis.html &mdash; EU22 languages
* JRC-Acquis &mdash; http://langtech.jrc.it/JRC-Acquis.html &mdash; EU22 languages
::Use this if you want to do <anything>--<anything>
::Use this if you want to do <anything>--<anything> and there is ''nothing better available''.
* Southeast European Times &mdash; http://xixona.dlsi.ua.es/~fran/setimes/ &mdash; English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian &mdash; 9,000 approx. paragraph aligned, 90,000&mdash;120,000 words.
* Southeast European Times &mdash; http://xixona.dlsi.ua.es/~fran/setimes/ &mdash; English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian &mdash; 9,000 approx. paragraph aligned, 90,000&mdash;120,000 words.
* South African Government Services &mdash; http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt &mdash; English&mdash;Afrikaans &mdash; 2,500 approx. sentence aligned, 49,375 words.
* South African Government Services &mdash; http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt &mdash; English&mdash;Afrikaans &mdash; 2,500 approx. sentence aligned, 49,375 words.

Revision as of 11:24, 30 November 2013

Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).

You might also want to use Wikipedia as a corpus, see Tagger_training#Creating_a_corpus or Building_dictionaries#Wikipedia_dumps and the cleanup script at Calculating_coverage.

Corpora

Use this if you want to do English--<something> (funny alignments for non-English pairs)
Use this if you want to do <anything>--<anything> and there is nothing better available.

Corpus tools