Corpora formats
Contents
Background
The apertium project uses corpora to test coverage, translation, and a number of other things(?). Some of these uses have different requirements for corpora, and a number of different formats for storing such corpora have sprung up. Some examples include:
- giant plain text file, no <tags>
- directory with single files per article:
- various combinations of single-file- and directory-based approaches
Jonathan (who's been working on a scraper to build corpora) has talked with Brendan (who's implementing quality testing around corpora) and Fran (who does a number of other things with the corpora, including having developed earlier implementations of what Jonathan and Brendan are working on). The following is an idea Jonathan has for implementing a standard corpora format for use by apertium.
Needs
A corpus should be easily parsed by software that needs to get data from it. There is also metadata that should be stored in the corpus, and there should be some thought given to directory structure too.
Metadata
Besides [article] content, a corpus ideally needs to include the following metadata:
- name/abbreviation of source
- language of content (per article)!
- time/date scraped/added (probably optional)
- article title (probably optional e.g., for non-articles)
- article source/link/url
- unique article identifier (hash of url or site's article id)
Directory structure
Jonathan's scraper defaulted to a single directory per language, with individual files per article, each file being named with the following convention:
- [source abbreviation].[url hash or source's article number].html
Brendan defaulted to one giant file per corpus, with no metadata.
Fran has implemented both of these schemata, but seems to prefer one directory per source, with one file per article.
The main disadvantage of one-file-per-article is that this can put a lot of pressure on the file system if a corpus gets at all large. The main disadvantage of a giant file per corpus is that it needs internal structure to be useful for certain purposes (i.e., more than just content), and can become difficult to work with manually.
In the end, the directory structure should be arbitrary.
Proposed implementation
Jonathan proposes an XML schema for storing a corpus. The directory structure default will be one directory per language, one file per source, each file containing one entry per article. However, since the structure is arbitrary, all this structure should be implemented into the XML somehow.
XML schema
<source abbrev="foo" name="Foo Bar's news" (language="en")>
   <entry (language="en") timescraped="[timestamp]" id="204">
      <title>Mr. X said "blah" today</title>
      <source>http://news.example.com/article/204</source>
      <content>
         Lorem ipsum dolor sit amet, consectetur adipiscing elit. In accumsan fringilla felis ac vehicula. Proin orci lacus, tincidunt et lobortis vitae, egestas a urna. Suspendisse id mi ut metus tempus cursus. Vivamus blandit, neque eget aliquam mollis, diam urna tempus eros, quis feugiat velit mi sit amet mi. Curabitur a arcu eu lectus ultrices facilisis. Pellentesque non ipsum eu lacus ultrices scelerisque. Sed id posuere felis. Fusce nisi orci, condimentum eu pellentesque vulputate, faucibus eu est. Sed sed libero suscipit ipsum volutpat euismod. Proin feugiat vehicula ullamcorper. Integer mattis tempor nunc quis tristique. Etiam commodo mollis lacus ullamcorper suscipit. Nullam pellentesque leo non odio mollis et imperdiet dui dictum.
Proin egestas dignissim lectus. Pellentesque tortor elit, tempus sit amet cursus non, porttitor in ipsum. Praesent eleifend imperdiet velit, et gravida magna vehicula ut. Donec vitae nibh augue, quis fringilla purus. Aliquam tincidunt hendrerit metus, ut tristique purus vehicula eget. Nam tincidunt dolor vel ligula vehicula aliquam. Fusce dignissim ullamcorper porta. Nulla venenatis semper felis a bibendum. Vestibulum non massa tellus, et vehicula orci. Pellentesque eleifend consectetur elit vitae semper. Aliquam mauris nisl, gravida sit amet consectetur sit amet, scelerisque ut dolor. Etiam eu vulputate nisl. Aliquam vulputate eleifend magna quis viverra.
Aliquam pharetra sem a nibh lobortis ac dapibus arcu hendrerit. Vestibulum elementum pulvinar tristique. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque rutrum commodo nunc id sodales. Donec consectetur lacus et felis pulvinar id consequat mauris lacinia. In hac habitasse platea dictumst. Etiam est arcu, sollicitudin ut posuere non, faucibus id sem. Nullam ultricies, risus sed rhoncus congue, massa justo euismod quam, nec pharetra purus nibh in erat. Nulla iaculis eros eu nisi egestas congue fermentum ante suscipit. Proin leo elit, pretium quis venenatis id, egestas nec urna. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Quisque congue turpis ac risus scelerisque a consequat sapien aliquet.
      <content>
   </entry>
   <entry>
      ... (etc)
   </entry>
</source>
Extra tools we'd need
- Some basic script that would allow us to grep the content only, either behaving like grep, or just cat (that we could then pipe to the real grep)

