Introduction to Screen Scraping

Posted by Alan Barr on Sat 29 September 2018

What is screen scraping?

Screen scraping is the act of downloading webpages to take data typically embedded in an html format and parse it into a simpler format. Ideally you are scraping for good intentions because you want to do something with public data that is not exposed in an easier to download manner such as an API, CSV, or JSON. Be considerate of who you are scraping from and be ethical. Often times I want to collect reviews and ratings and do some kind of data science on that data and a scraper may be required.

Before you start writing your own scraper you want to investigate whether it is worth doing in the first place. Often times the data is exposed through a hidden API or structured JSON already that negates any need for parsing HTML. I always start with opening the browser developer tools, selecting the network tab and look for requests that return JSON or XML.

Getting Started

Now that you determined there is no easier way to get your data except parsing html the first thing you want to do is download the webpages. Ideally you want to separate the process of obtaining the data and the scraping process into separate actions. This will make the whole process more robust. If the website is client side dynamically generated this process will require usage of a headless browser to download the rendered content. You can see an example of this at here otherwise if it is all server rendered that will make things much easier.

Downloading the page

This can be accomplished in a variety of ways and for learning purposes saving it to a local folder should work well enough. Start with collecting a list of pages that need to be downloaded. Below are a couple shell scripts to accomplish the download in powershell or bash. You want to keep track any webpage request that failed to download.

Powershell

$sites = @("https://alanmbarr.github.io/HackMidWestTimeline/","https://example.com")
$counter = 0
For-Each ($site in $sites){
    Invoke-WebRequest -Uri $site -OutFile "C:\\$counter.html"
    $counter++
}

Bash

#!/bin/bash
set -o nounset
set -o errexit
set -eo pipefail
IFS=","

counter=0
sites=("https://alanmbarr.github.io/HackMidWestTimeline/","https://example.com")
for site in $sites
do
   echo $site
   curl -s $site > $counter.html;
   ((counter++));
done

Parsing the files

Now that you have all these files on your machine somewhere you want to choose an html parser to work with. Your goal is to deal with missing data and come up with a format that is going to be workable for when you use this data. For simplicity sake let's use tab delimited files (separated by \t ).

Most of the logic around parsing the HTML file is going to be around dealing with missing data. For example if I want to grab all the reviews for my company I want any data related to that review and a review might have data unique to the reviewer that not all reviews have.

The ideal html parser you will want to choose will support XPath and/or CSS selectors. You will want to choose generic paths that do not depend on a specific hierarchy of elements to select your data and hopefully the structure of the html does not change much per page.

For PHP I like simple html dom and there is a great resource on how to use it here For other languages there is a great list of options here

Creating a Data Model

In object oriented languages this would likely be a class but this could be as simple as a key value pair data structure like a dictionary. As you begin parsing your files you want to collect the data into this data model and then in our case write it to a text file tab delimited. On the first run we want to view this file to see that the data looks correct.

As you are tweaking this script you want to have safe defaults for when the data is missing or might have strange content.

Scaling out

If you are going to take this any further you'll need write even more robust code around the scrape process. Hartley Brody has an excellent list of tips In particular if you take scraping to the next level using a database for saving results is great.