Difference between revisions of "Writing a scraper"

From Apertium
Jump to navigation Jump to search
(Issues with Newlines section started)
Line 22: Line 22:
 
return cleaned.strip()
 
return cleaned.strip()
 
</pre>
 
</pre>
  +
  +
== Issues with newlines ==
  +
'''Problem:''' The characters "& # 1 3 ;" (spaced apart intentionally) appear throughout after scraped content is written to .xml file.<br />
  +
'''Research:''' Retrieving the page html through using either <code>curl</code> or <code>wget</code> results in the problematic characters not appearing in final .xml output, however they reappear when the html is downloaded through a Python HTTPConnection. Since furthermore the characters are not present in other preceding output of the page html, it can be intelligently assumed that the error occurs with lxml: <code>lxml.html.document_fromstring(lxml.html.clean.clean_html(lxml.html.tostring(doc.find_class('zoomMe')[1]).decode('utf-8')))</code>. Directly following this step, the characters appear in the xml output. However, that still leaves uncertain the discrepancy between manually downloaded code and python downloaded code. This difference is likely due to <code>curl</code> and <code>wget</code> treating the code differently than python does. This can be painlessly confirmed with a <code> diff</code> command which confirms that most (i.e. 95%) of the discrepancies are whitespace. The characters represent "\r", the carriage return. [http://stackoverflow.com/questions/1459170/what-is-13 Online research] shows that these problems can be attributed to Windows being stupid: "When you code in windows, and use "DOS/Windows" line endings, the your lines will end like this "\r\n". In some xhtml editors, that "\r" is illegal so the editor coverts it to "& # 1 3"."<br />
  +
'''Solution:''' The simplest solution is to manually remove the "\r" from raw html after download, like so: <code>res.read().decode('utf-8').replace('\r',' ')</code>. This should have no side effects for two reasons. One, HTML generally ignores conventional whitespace. Two, each "\r" is likely followed by a "\n", so replacing "\r" with nothing will only remove extraneous characters while otherwise preserving whitespace. This will solve the problem because the problematic characters represent "\r". Unfortunately, at this point I see no other more elegant solution. This could be reported to lxml as a bug.<br /><br />
  +
'''Problem:''' The character "x" appears throughout after scraped content is written to .xml file.<br />
  +
'''Research:'''<br />
  +
'''Solution:'''<br /><br />
  +
'''Problem:'''Paragraphs are not always being created correctly in scraped content, i.e. breaks tags are occasionally ignored<br />
  +
'''Research:'''<br />
  +
'''Solution:'''<br />

Revision as of 16:35, 2 January 2013

This page outlines how to develop a scraper for apertium using our RFERL scraper. The code can be found in our subversion repository at https://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-tools/scraper.

scrp-*.py

The first thing you'll need is a script that gets the urls of a bunch of articles and titles. This script then loops through the news articles and titles and sends it to the Scraper class you'll make (below) to fill the corpus. Have a look at the various scrp-*.py scripts currently available to get a feel for how they work.

scrapers.py

You need to define a new class in scrapers.py that inherits the Scraper class.

Your new class will have two new functions:

  • url_to_aid():
This takes a url and converts it to a unique "article id". For sites that use some form of unique id for their articles (e.g., http://example.com/news?id=3141592653 or http://example.com/news/3141592653.html), you'll want to extract the id, probably with a simple regex. However, if this is for some reason not unique, or the site doesn't use unique ids, or if it's difficult to extract for some reason, it's okay to make a hash of the full url (which should be unique...). There are examples of both of these implemented in other scrapers in scrapers.py
  • scraped():
The first thing this function does is to fill self.doc with the contents of the page, by calling self.get_content(). This is all written for you already, so just call the function once and you're ready for the hard stuff.
The hard stuff consists of getting a cleaned, text-only version of just the article content from the page. You'll have to first make sure you know which element in the page is going to consistently contain just the article content, and then extract that out with lxml. You'll then want to take that element and clean it with lxml (since there are scripts and stuff that can be in there too that could get in the output), and then get the .text_content() of the element. An example of all this follows:
        self.get_content()
        cleaned = lxml.html.document_fromstring(lxml.html.clean.clean_html(lxml.html.tostring(self.doc.xpath("//div[@align='justify']")[0]).decode('utf-8')))
        cleaned = cleaned.text_content()
        return cleaned.strip()

Issues with newlines

Problem: The characters "& # 1 3 ;" (spaced apart intentionally) appear throughout after scraped content is written to .xml file.
Research: Retrieving the page html through using either curl or wget results in the problematic characters not appearing in final .xml output, however they reappear when the html is downloaded through a Python HTTPConnection. Since furthermore the characters are not present in other preceding output of the page html, it can be intelligently assumed that the error occurs with lxml: lxml.html.document_fromstring(lxml.html.clean.clean_html(lxml.html.tostring(doc.find_class('zoomMe')[1]).decode('utf-8'))). Directly following this step, the characters appear in the xml output. However, that still leaves uncertain the discrepancy between manually downloaded code and python downloaded code. This difference is likely due to curl and wget treating the code differently than python does. This can be painlessly confirmed with a diff command which confirms that most (i.e. 95%) of the discrepancies are whitespace. The characters represent "\r", the carriage return. Online research shows that these problems can be attributed to Windows being stupid: "When you code in windows, and use "DOS/Windows" line endings, the your lines will end like this "\r\n". In some xhtml editors, that "\r" is illegal so the editor coverts it to "& # 1 3"."
Solution: The simplest solution is to manually remove the "\r" from raw html after download, like so: res.read().decode('utf-8').replace('\r',' '). This should have no side effects for two reasons. One, HTML generally ignores conventional whitespace. Two, each "\r" is likely followed by a "\n", so replacing "\r" with nothing will only remove extraneous characters while otherwise preserving whitespace. This will solve the problem because the problematic characters represent "\r". Unfortunately, at this point I see no other more elegant solution. This could be reported to lxml as a bug.

Problem: The character "x" appears throughout after scraped content is written to .xml file.
Research:
Solution:

Problem:Paragraphs are not always being created correctly in scraped content, i.e. breaks tags are occasionally ignored
Research:
Solution: