Difference between revisions of "Corpora"
ScoopGracie (talk | contribs) (→Corpora: Broken link... :() |
|||
(13 intermediate revisions by 8 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCD}} |
|||
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.). |
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.). |
||
You might also want to use Wikipedia as a corpus, see [[Tagger_training#Creating_a_corpus]] or [[Building_dictionaries#Wikipedia_dumps]] and the cleanup |
You might also want to use Wikipedia as a corpus, see [[Tagger_training#Creating_a_corpus]] or [[Building_dictionaries#Wikipedia_dumps]] and the cleanup scripts at [[Wikipedia dumps]]. |
||
==Corpora== |
==Corpora== |
||
Line 7: | Line 8: | ||
* EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language |
* EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language |
||
::Use this if you want to do English--<something> (funny alignments for non-English pairs) |
::Use this if you want to do English--<something> (funny alignments for non-English pairs) |
||
⚫ | |||
* JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages |
|||
::Use this if you want to do <anything>--<anything> |
|||
⚫ | |||
* South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words. |
* South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words. |
||
* IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian |
* IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian |
||
* OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora |
* OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora |
||
* Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases) |
* Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases) |
||
* Tatoeba Project — http://tatoeba.org/ — Database of example sentences translated into several languages. |
|||
* Translated.by — http://translatedby.com/ (Various licenses) |
|||
* Heidelberg Named Entity Resource — http://heiner.cl.uni-heidelberg.de |
|||
== Corpus tools == |
== Corpus tools == |
||
Line 20: | Line 22: | ||
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web |
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web |
||
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web |
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web |
||
== Format == |
|||
We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition). The [[Writing a scraper|RFERL scraper]] generates corpora that follow this format. |
|||
In order to use the XSD (recommended), the root <code><corpus></code> must have these attributes set: |
|||
<code> |
|||
xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd" |
|||
</code> |
|||
A complete example follows, |
|||
<code> |
|||
<pre> |
|||
<?xml version="1.0"?> |
|||
<corpus xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd" name="rferl" language="tuk"> |
|||
<entry date="2011-12-29" timestamp="2012-04-22T01:56:42.813713" title="Rus hem türkmen saýlawlary: meňzeşlikler we tapawutlar " id="rferl.24437444" source="http://www.azathabar.com/content/article/24437444.html">Prezident saýlawlary Türkmenistanda 12-nji fewralda, Orsýet Federasiýasynda 4-nji martda geçirilýär. Türkmenistanda şu günler prezidentlige kandidatlary hödürlemek kampaniýasy dowam edýär. etc.</entry> |
|||
<entry date="2011-12-19" timestamp="2012-04-22T01:56:45.102281" title="Aşgabadyň ýaşaýjylary Täze ýyla taýýarlanýar" id="rferl.24426930" source="http://www.azathabar.com/content/article/24426930.html">Aşgabat täze 2012-nji ýyly garşylamaga bir aýa golaý wagt öňünden taýýarlyk görüp başlady. Şäheriň köçelerinde, seýilgählerde, edara-kärhanalaryň öňünde gögerip oturan arça agaçlary bilen birlikde, boýy 12-15 metre ýetýän emeli arçalar hem dürli oýnawaçlara bürendiler. etc.</entry> |
|||
</corpus> |
|||
</pre> |
|||
</code> |
|||
On the server hosting the XSD, <code>corpus.xsd</code> should be in <code>/xml/corpus/0.9/</code>. |
|||
=== XSD === |
|||
<code> |
|||
<pre> |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema"> |
|||
<xs:element name="corpus"> |
|||
<xs:complexType> |
|||
<xs:sequence> |
|||
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0"> |
|||
<xs:complexType> |
|||
<xs:simpleContent> |
|||
<xs:extension base="xs:string"> |
|||
<xs:attribute type="xs:date" name="date" use="optional"/> |
|||
<xs:attribute type="xs:dateTime" name="timestamp" use="optional"/> |
|||
<xs:attribute type="xs:string" name="title" use="optional"/> |
|||
<xs:attribute type="xs:string" name="author" use="optional"/> |
|||
<xs:attribute type="xs:string" name="id" use="required"/> |
|||
<xs:attribute type="xs:anyURI" name="source" use="required"/> |
|||
</xs:extension> |
|||
</xs:simpleContent> |
|||
</xs:complexType> |
|||
</xs:element> |
|||
</xs:sequence> |
|||
<xs:attribute type="xs:string" name="name" use="required"/> |
|||
<xs:attribute type="xs:string" name="language" use="required"/> |
|||
</xs:complexType> |
|||
</xs:element> |
|||
</xs:schema> |
|||
</pre> |
|||
</code> |
|||
=== DTD === |
|||
<code> |
|||
<pre> |
|||
<!ELEMENT corpus ( entry+ ) > |
|||
<!ATTLIST corpus language CDATA #REQUIRED > |
|||
<!ATTLIST corpus name NMTOKEN #REQUIRED > |
|||
<!ATTLIST corpus xmlns CDATA #REQUIRED > |
|||
<!ELEMENT entry ( #PCDATA ) > |
|||
<!ATTLIST entry date CDATA #IMPLIED > |
|||
<!ATTLIST entry id ID #REQUIRED > |
|||
<!ATTLIST entry source CDATA #REQUIRED > |
|||
<!ATTLIST entry timestamp CDATA #IMPLIED > |
|||
<!ATTLIST entry title CDATA #IMPLIED > |
|||
<!ATTLIST entry author CDATA #IMPLIED > |
|||
</pre> |
|||
</code> |
|||
[[Category:Resources]] |
[[Category:Resources]] |
||
[[Category:Documentation in English]] |
Latest revision as of 20:36, 25 January 2020
Contents |
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).
You might also want to use Wikipedia as a corpus, see Tagger_training#Creating_a_corpus or Building_dictionaries#Wikipedia_dumps and the cleanup scripts at Wikipedia dumps.
Corpora[edit]
- EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language
- Use this if you want to do English--<something> (funny alignments for non-English pairs)
- Southeast European Times — http://opus.lingfil.uu.se/SETIMES.php — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words.
- South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words.
- IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian
- OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora
- Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases)
- Tatoeba Project — http://tatoeba.org/ — Database of example sentences translated into several languages.
- Translated.by — http://translatedby.com/ (Various licenses)
- Heidelberg Named Entity Resource — http://heiner.cl.uni-heidelberg.de
Corpus tools[edit]
- Corpus Catcher — http://translate.sourceforge.net/wiki/corpuscatcher/index - Bootstrap corpora from the web
- BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
- Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web
Format[edit]
We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition). The RFERL scraper generates corpora that follow this format.
In order to use the XSD (recommended), the root <corpus>
must have these attributes set:
xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd"
A complete example follows,
<?xml version="1.0"?>
<corpus xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd" name="rferl" language="tuk">
<entry date="2011-12-29" timestamp="2012-04-22T01:56:42.813713" title="Rus hem türkmen saýlawlary: meňzeşlikler we tapawutlar " id="rferl.24437444" source="http://www.azathabar.com/content/article/24437444.html">Prezident saýlawlary Türkmenistanda 12-nji fewralda, Orsýet Federasiýasynda 4-nji martda geçirilýär. Türkmenistanda şu günler prezidentlige kandidatlary hödürlemek kampaniýasy dowam edýär. etc.</entry>
<entry date="2011-12-19" timestamp="2012-04-22T01:56:45.102281" title="Aşgabadyň ýaşaýjylary Täze ýyla taýýarlanýar" id="rferl.24426930" source="http://www.azathabar.com/content/article/24426930.html">Aşgabat täze 2012-nji ýyly garşylamaga bir aýa golaý wagt öňünden taýýarlyk görüp başlady. Şäheriň köçelerinde, seýilgählerde, edara-kärhanalaryň öňünde gögerip oturan arça agaçlary bilen birlikde, boýy 12-15 metre ýetýän emeli arçalar hem dürli oýnawaçlara bürendiler. etc.</entry>
</corpus>
On the server hosting the XSD, corpus.xsd
should be in /xml/corpus/0.9/
.
XSD[edit]
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="corpus">
<xs:complexType>
<xs:sequence>
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute type="xs:date" name="date" use="optional"/>
<xs:attribute type="xs:dateTime" name="timestamp" use="optional"/>
<xs:attribute type="xs:string" name="title" use="optional"/>
<xs:attribute type="xs:string" name="author" use="optional"/>
<xs:attribute type="xs:string" name="id" use="required"/>
<xs:attribute type="xs:anyURI" name="source" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute type="xs:string" name="name" use="required"/>
<xs:attribute type="xs:string" name="language" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
DTD[edit]
<!ELEMENT corpus ( entry+ ) >
<!ATTLIST corpus language CDATA #REQUIRED >
<!ATTLIST corpus name NMTOKEN #REQUIRED >
<!ATTLIST corpus xmlns CDATA #REQUIRED >
<!ELEMENT entry ( #PCDATA ) >
<!ATTLIST entry date CDATA #IMPLIED >
<!ATTLIST entry id ID #REQUIRED >
<!ATTLIST entry source CDATA #REQUIRED >
<!ATTLIST entry timestamp CDATA #IMPLIED >
<!ATTLIST entry title CDATA #IMPLIED >
<!ATTLIST entry author CDATA #IMPLIED >