Difference between revisions of "Corpora"

From Apertium
Jump to navigation Jump to search
Line 1: Line 1:
  +
{{TOCD}}
 
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).
 
Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).
   
Line 21: Line 22:
 
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
 
* BootCaT — http://sslmit.unibo.it/~baroni/bootcat.html - Simple Utilities to Bootstrap Corpora and Terms from the Web
 
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web
 
* Bitextor — http://sourceforge.net/projects/bitextor/ - Bootstrap bilingual corpora from the web
  +
  +
== Format ==
  +
  +
We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition).
  +
  +
In order to use the XSD (recommended), the root <code><corpus></code> must have these attributes set:
  +
  +
<code>xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd"</code>.
  +
  +
=== XSD ===
  +
<code>
  +
<pre>
  +
<?xml version="1.0" encoding="UTF-8"?>
  +
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  +
<xs:element name="corpus">
  +
<xs:complexType>
  +
<xs:sequence>
  +
<xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
  +
<xs:complexType>
  +
<xs:simpleContent>
  +
<xs:extension base="xs:string">
  +
<xs:attribute type="xs:date" name="date" use="optional"/>
  +
<xs:attribute type="xs:dateTime" name="timestamp" use="optional"/>
  +
<xs:attribute type="xs:string" name="title" use="optional"/>
  +
<xs:attribute type="xs:string" name="id" use="required"/>
  +
<xs:attribute type="xs:anyURI" name="source" use="required"/>
  +
</xs:extension>
  +
</xs:simpleContent>
  +
</xs:complexType>
  +
</xs:element>
  +
</xs:sequence>
  +
<xs:attribute type="xs:string" name="name" use="required"/>
  +
<xs:attribute type="xs:string" name="language" use="required"/>
  +
</xs:complexType>
  +
</xs:element>
  +
</xs:schema>
  +
</pre>
  +
</code>
  +
  +
=== DTD ===
  +
<code>
  +
<pre>
  +
<!ELEMENT corpus ( entry+ ) >
  +
<!ATTLIST corpus language CDATA #REQUIRED >
  +
<!ATTLIST corpus name NMTOKEN #REQUIRED >
  +
<!ATTLIST corpus xmlns CDATA #REQUIRED >
  +
  +
<!ELEMENT entry ( #PCDATA ) >
  +
<!ATTLIST entry date CDATA #IMPLIED >
  +
<!ATTLIST entry id ID #REQUIRED >
  +
<!ATTLIST entry source CDATA #REQUIRED >
  +
<!ATTLIST entry timestamp CDATA #IMPLIED >
  +
<!ATTLIST entry title CDATA #IMPLIED >
  +
</pre>
  +
</code>
   
 
[[Category:Resources]]
 
[[Category:Resources]]

Revision as of 07:31, 9 January 2015

Lists of corpora under free licences (public domain, CC-BY-SA, GPL, etc.).

You might also want to use Wikipedia as a corpus, see Tagger_training#Creating_a_corpus or Building_dictionaries#Wikipedia_dumps and the cleanup script at Calculating_coverage.

Corpora

Use this if you want to do English--<something> (funny alignments for non-English pairs)
Use this if you want to do <anything>--<anything> and there is nothing better available.

Corpus tools

Format

We use a standard format for corpora, defined by either an XSD (XML Schema Definition) or DTD (Document Type Definition).

In order to use the XSD (recommended), the root <corpus> must have these attributes set:

xmlns="http://apertium.org/xml/corpus/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://apertium.org/xml/corpus/0.9 http://apertium.org/xml/corpus/0.9/corpus.xsd".

XSD

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" targetNamespace="http://apertium.org/xml/corpus/0.9" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="corpus">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="entry" maxOccurs="unbounded" minOccurs="0">
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="xs:string">
                <xs:attribute type="xs:date" name="date" use="optional"/>
                <xs:attribute type="xs:dateTime" name="timestamp" use="optional"/>
                <xs:attribute type="xs:string" name="title" use="optional"/>
                <xs:attribute type="xs:string" name="id" use="required"/>
                <xs:attribute type="xs:anyURI" name="source" use="required"/>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute type="xs:string" name="name" use="required"/>
      <xs:attribute type="xs:string" name="language" use="required"/>
    </xs:complexType>
  </xs:element>
</xs:schema>

DTD

<!ELEMENT corpus ( entry+ ) >
<!ATTLIST corpus language CDATA #REQUIRED >
<!ATTLIST corpus name NMTOKEN #REQUIRED >
<!ATTLIST corpus xmlns CDATA #REQUIRED >
 
<!ELEMENT entry ( #PCDATA ) >
<!ATTLIST entry date CDATA #IMPLIED >
<!ATTLIST entry id ID #REQUIRED >
<!ATTLIST entry source CDATA #REQUIRED >
<!ATTLIST entry timestamp CDATA #IMPLIED >
<!ATTLIST entry title CDATA #IMPLIED >