Writing a scraper

This page outlines how to develop a scraper for apertium using our RFERL scraper. The code can be found in our subversion repository at https://svn.code.sf.net/p/apertium/svn/trunk/apertium-tools/scraper.

Outline

Get to know the website

Visit the website which you plan to scrape and locate the archive section which usually offers an interface to select a given day and see a list of links to articles published on that day.
- If you can't understand the language the website is written in, ask for help in IRC or use a translator and look for a section marked "Archive". If you're unable to locate an archive, find the sitemap and use it as a starting point.
- Sometimes you'll be able to locate a calendar that links to a page with articles from each date, often an optimal situation.
Familiarize yourself with the structure of the URL and how manipulating it will yield a different set of articles to scrape.
- Try to only scrape pages that will be useful. For example, scraping a picture gallery will yield few words so concentrate on scraping the more densely packed articles such as the news.
- The URL will sometimes contain a date which can be manipulated to yield all the articles published on a certain day. (e.g. http://example.com/archive/news/20121104.html - "20121104" indicates that this URL will show a list of articles published on 04/11/2012)
- Other common configurations include having a sequential number which marks pages of articles chronologically. For example, the latest articles have a URL containing "1" and older ones "2", etc. (e.g. http://example.com/archive/news/343.html - "343" indicates that this will show the 343rd page of the list of news articles)
- Devise a URL template that you can use string substitutions on to construct a list of URLs to lists of article links.

Get a list of articles

Write a driver script named scrp-*.py which will given a certain range of dates (or other parameters depending on the site's structure, e.g. how many pages of articles to scrape if there is no calendar support), be able to generate, for example, a list of tuples containing the article's link, its title and its publication date.

LXML and BeautifulSoup are two useful tools for scraping HTML.
Use Chrome/Firefox's Developer Console with Inspect Element to find distinguishing characteristics for each article link element. For example, each article link could be wrapped in a div with .articleLink (it's not always that obvious).
Using LXML offers many choices when extracting the article info from the page, from picking specific CSS classes to arbitrary XPath expressions.
If you find that selecting all the article info requires a more complex CSS selector, use a CSS to XPath converter. For example, consider a situation where the link tag to each article has the articleLink class. A possible CSS selector for this would be .articleLink which would become descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' articleLink ')] if converted into an XPath expression. An example using LXML with each expression is demonstrated below

rawArticlesHtml = getPage(conn, url, rawContent = True)
articlesHtml = lxml.html.fromstring(rawArticlesHtml)
articleTags = articlesHtml.xpath("descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' articleLink ')]") #XPath Method
articleTags = articlesHtml.find_class("articleLink") #CSS Selector Method

As you populate the article list, writing the list to a file is useful for debugging. Outputting it to the console could fail due to character encoding issues. Look below for a useful helper method that does either depending on the display parameter.

def printArticles(articlesData, fileName, display=False):
	if display:
		for (title, url, date) in articlesData:
			print(title, url, date.isoformat())
	else:
		with open(fileName, 'a', encoding='utf-8') as file:
			for (title, url, date) in articlesData:
				file.write("%s, %s, %s\n" % (title, url, date.isoformat()))

Don't worry too much about accidentally populating the article list with duplicate URLs, the scraper is designed to ignore duplicate articles (as long as you implement the url_to_aid() function correctly later).

Add a `Scraper` class

Add an entry to the feed_sites dictionary in scraper_classes.py which maps from the name of the website to a unique Scraper class.
Define a new class in scrapers.py that inherits the Scraper class with two functions: url_to_aid() and scraped() with very important specifications.

url_to_aid(): This function will take as an input and convert it to a unique "article id" (aid).

Many sites will use a unique ID inside their article URLs (e.g., http://example.com/news?id=3141592653 or http://example.com/news/3141592653.html), these are fairly simple to extract using a regex or string splitting.
However, if this is for some reason not unique, or the site doesn't use unique ids, or if it's difficult to extract for some reason, it's okay to make a hash of the full url (which should be unique...).
There are examples of both of these methods implemented in other scrapers in scrapers.py. Take a look if you get stuck.

scraped(): This function will take as "input" the HTML contents of the article and output a cleaned version of the article's text for inclusion in the XML corpus.

First, fill self.doc with the contents of the page, by calling self.get_content(). This is all written for you already, so just call the function once and you're ready for the hard stuff.
Now, LXML/BeautifulSoup will be very useful for scraping the actual article content from the HTML of the entire page.
Most likely, the article text will be wrapped in some sort of an identifiable container, so follow a similar procedure to that which proved useful when populating the list of articles, and identify this element.
Take the element which contains the article content, extract it from the HTML, and then clean it with LXML (to remove scripts, etc. which shouldn't be in the corpus).
The cleaning procedure below often suffices to remove all the HTML tags, changing break tags and paragraph tags into line breaks as necessary.

self.get_content()
cleaned = lxml.html.document_fromstring(lxml.html.clean.clean_html(lxml.html.tostring(self.doc.xpath("//div[@align='justify']")[0]).decode('utf-8')))
cleaned = cleaned.text_content()
return cleaned.strip()

Sometimes, this won't suffice and you'll have to be able to identify the offending elements and remove them manually from the HTML before invoking LXML's clean.

Use `Scraper` class and test

Finally, in the driver script loop through the list of articles and send each article to the Scraper class you created to fill the corpus with articles. Have a look at the various scrp-*.py scripts currently available to get a feel for how to use the Scraper class. The code below demonstrates the basic idea.

Make sure to set the correct language code when setting up the Source class.
Catch exceptions that occur during scraping but don't fail silently. You don't want a single badly formatted article to stop the entire process.

for (title, url, date) in articles:
	try:
		source = Source(url, title=title, date=date, scraper=ScraperAzadliq, conn=conn) #replace scraper with the one you created earlier
		source.makeRoot("./", ids=ids, root=root, lang="aze") #replace language with the appropriate one
		source.add_to_archive()
		if ids is None:
			ids = source.ids
		if root is None:
			root = source.root
	except Exception as e:
		print(url + " " + str(e))

Scrape a sufficient amount of test articles to determine whether there is any extraneous output in the generated corpus (check the XML file created). If you discover that something is wrong, check the scraped() function again to make sure that you've removed all the bad elements.

Make sure the article IDs generated are unique.
Make sure the URL for each entry corresponds to the article's ID, its title and its publication date.

RFERL

If you are scraping RFERL content, you will need category names and numbers of only the real content categories.

Issues with newlines

Problem: The characters "& # 1 3 ;" (spaced apart intentionally) appear throughout after scraped content is written to .xml file.
Research: Retrieving the page html through using either curl or wget results in the problematic characters not appearing in final .xml output, however they reappear when the html is downloaded through a Python HTTPConnection. Since furthermore the characters are not present in other preceding output of the page html, it can be intelligently assumed that the error occurs with lxml: lxml.html.document_fromstring(lxml.html.clean.clean_html(lxml.html.tostring(doc.find_class('zoomMe')[1]).decode('utf-8'))). Directly following this step, the characters appear in the xml output. However, that still leaves uncertain the discrepancy between manually downloaded code and python downloaded code. This difference is likely due to curl and wget treating the code differently than python does. This can be painlessly confirmed with a diff command which confirms that most (i.e. 95%) of the discrepancies are whitespace. The characters represent "\r", the carriage return. Online research shows that these problems can be attributed to Windows being incompatible with Linux\Unix standards: "When you code in windows, and use "DOS/Windows" line endings, the your lines will end like this "\r\n". In some xhtml editors, that "\r" is illegal so the editor coverts it to "& # 1 3"." Accordingly, running scrp-azzatyk.py shows that the offending characters unilaterally appear following the end of lines in the HTML.
Suggested Solution: The simplest solution is to manually remove the "\r" from raw html after download, like so: res.read().decode('utf-8').replace('\r',' '). This should have no side effects for two reasons. One, HTML generally ignores conventional whitespace. Two, each "\r" is likely followed by a "\n", so replacing "\r" with nothing will only remove extraneous characters while otherwise preserving whitespace. This will solve the problem because the problematic characters represent "\r". This type of a solution to this seemingly not uncommon problem has been utilized by others and will ensure compatibility with Windows style "\r\n".This "solution" has been implemented.

Problem: The character "x" appears throughout after scraped content is written to .xml file.
Research & Solution: The problem was a small error due to not filtering out a bad class in ScraperAzattyk, the problem has been fixed and will be committed. This solution has been committed.

Problem: Paragraphs are not always being created correctly in scraped content, i.e. breaks tags are occasionally ignored
Research: Testing shows that the problem is occurring when two break tags are present on two separate lines and they are directly followed by another tag, generally an em or a strong, however the same problem has been observed with other tags. In the case that the break tags are seperated by text, lxml properly handles them. However, in the case that they are not, lxml fails to properly recognize the break tags. Test script, Test HTML
Suggested Solution: Submit a bug report to lxml. We could create custom Element classes? I'm fairly sure that even if we managed to do that, it would be fairly inelegant. A bug report has been filed. Turns out that the bug was in libxml2 rather than lxml and was addressed in a newer version of libxml2 (check the bug report)

Writing a scraper

Contents

Outline

Get to know the website

Get a list of articles

Add a `Scraper` class

Use `Scraper` class and test

RFERL

Issues with newlines

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

Writing a scraper

Contents

Outline

Get to know the website

Get a list of articles

Add a Scraper class

Use Scraper class and test

RFERL

Issues with newlines

Navigation menu

Search

Add a `Scraper` class

Use `Scraper` class and test