Difference between revisions of "Corpora"
Jump to navigation
Jump to search
Line 1: | Line 1: | ||
{{TOCD}} |
|||
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.). |
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.). |
||
Line 21: | Line 22: | ||
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web |
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web |
||
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web |
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web |
||
== Format == |
|||
We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition). |
|||
In order to use the XSD (recommended), the root <code><corpus></code> must have these attributes set: |
|||
<code>xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd"</code>. |
|||
=== XSD === |
|||
<code> |
|||
<pre> |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema"> |
|||
<xs:element name="corpus"> |
|||
<xs:complexType> |
|||
<xs:sequence> |
|||
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0"> |
|||
<xs:complexType> |
|||
<xs:simpleContent> |
|||
<xs:extension base="xs:string"> |
|||
<xs:attribute type="xs:date" name="date" use="optional"/> |
|||
<xs:attribute type="xs:dateTime" name="timestamp" use="optional"/> |
|||
<xs:attribute type="xs:string" name="title" use="optional"/> |
|||
<xs:attribute type="xs:string" name="id" use="required"/> |
|||
<xs:attribute type="xs:anyURI" name="source" use="required"/> |
|||
</xs:extension> |
|||
</xs:simpleContent> |
|||
</xs:complexType> |
|||
</xs:element> |
|||
</xs:sequence> |
|||
<xs:attribute type="xs:string" name="name" use="required"/> |
|||
<xs:attribute type="xs:string" name="language" use="required"/> |
|||
</xs:complexType> |
|||
</xs:element> |
|||
</xs:schema> |
|||
</pre> |
|||
</code> |
|||
=== DTD === |
|||
<code> |
|||
<pre> |
|||
<!ELEMENT corpus ( entry+ ) > |
|||
<!ATTLIST corpus language CDATA #REQUIRED > |
|||
<!ATTLIST corpus name NMTOKEN #REQUIRED > |
|||
<!ATTLIST corpus xmlns CDATA #REQUIRED > |
|||
<!ELEMENT entry ( #PCDATA ) > |
|||
<!ATTLIST entry date CDATA #IMPLIED > |
|||
<!ATTLIST entry id ID #REQUIRED > |
|||
<!ATTLIST entry source CDATA #REQUIRED > |
|||
<!ATTLIST entry timestamp CDATA #IMPLIED > |
|||
<!ATTLIST entry title CDATA #IMPLIED > |
|||
</pre> |
|||
</code> |
|||
[[Category:Resources]] |
[[Category:Resources]] |
Revision as of 07:31, 9 January 2015
Contents |
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).
You might also want to use Wikipedia as a corpus, see Tagger_training#Creating_a_corpus or Building_dictionaries#Wikipedia_dumps and the cleanup script at Calculating_coverage.
Corpora
- EuroParl — http://www.statmt.org/europarl/ — EU12 languages up to 44 million words per language
- Use this if you want to do English--<something> (funny alignments for non-English pairs)
- JRC-Acquis — http://langtech.jrc.it/JRC-Acquis.html — EU22 languages
- Use this if you want to do <anything>--<anything> and there is nothing better available.
- Southeast European Times — http://xixona.dlsi.ua.es/~fran/setimes/ — English,Turkish,Bulgarian,Macedonian,Serbo-Croatian,Albanian,Greek,Romanian — 9,000 approx. paragraph aligned, 90,000—120,000 words.
- South African Government Services — http://xixona.dlsi.ua.es/~fran/services-gov-za-en_ZA-af_ZA.txt — English—Afrikaans — 2,500 approx. sentence aligned, 49,375 words.
- IJS-ELAN — http://nl.ijs.si/elan/ — English-Slovenian
- OPUS — http://urd.let.rug.nl/tiedeman/OPUS/index.php — Open Source multilingual corpora
- Open-Tran — http://www.open-tran.eu — single point of access to translations of open-source software in many languages (downloadable as SQLite databases)
- Tatoeba Project — http://tatoeba.org/ — Database of example sentences translated into several languages.
Corpus tools
- Corpus Catcher — http://translate.sourceforge.net/wiki/corpuscatcher/index - Bootstrap corpora from the web
- BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
- Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web
Format
We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition).
In order to use the XSD (recommended), the root <corpus>
must have these attributes set:
xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd"
.
XSD
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="corpus">
<xs:complexType>
<xs:sequence>
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute type="xs:date" name="date" use="optional"/>
<xs:attribute type="xs:dateTime" name="timestamp" use="optional"/>
<xs:attribute type="xs:string" name="title" use="optional"/>
<xs:attribute type="xs:string" name="id" use="required"/>
<xs:attribute type="xs:anyURI" name="source" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:sequence>
<xs:attribute type="xs:string" name="name" use="required"/>
<xs:attribute type="xs:string" name="language" use="required"/>
</xs:complexType>
</xs:element>
</xs:schema>
DTD
<!ELEMENT corpus ( entry+ ) >
<!ATTLIST corpus language CDATA #REQUIRED >
<!ATTLIST corpus name NMTOKEN #REQUIRED >
<!ATTLIST corpus xmlns CDATA #REQUIRED >
<!ELEMENT entry ( #PCDATA ) >
<!ATTLIST entry date CDATA #IMPLIED >
<!ATTLIST entry id ID #REQUIRED >
<!ATTLIST entry source CDATA #REQUIRED >
<!ATTLIST entry timestamp CDATA #IMPLIED >
<!ATTLIST entry title CDATA #IMPLIED >