Scraping HTML to use for data science purposes

Posted by Alan Barr on Sun 09 April 2017

Recently the company that I work for Veterans United made it to #27 on Forbes list of "Best Places to Work". One of the nice things about this page is that it presents a lot of easily scrapeable data about all 100 of the best places to work. My initial goal of this project was to record in a text file the diversity data stored on each pages tables. I have been spoiled by C#'s excellent support of async/await keywords and wanted to see how python 3.6 would change my normal work flows.

One challege I have run into with the latest python and even dotnet core is that everything is in constant flux. By the time I have written about one topic many of the core utilities change from underneath. For example dotnet core moved from a json configuration back to an xml .csproj file. The reasons make sense for the change but this adds confusion later as we go about constantly googling what might be the latest idiom for doing an action in the language.

Overall I like working with python when I want to iterate on a prototype quickly and not have to work too much about decisions that don't matter immediately. The documentation is great and there is a wealth of modules available. I decided that to grab this data I would do it in a classical manner that I would and try the asynchronous manner as well. The purpose of choosing one or the other is to eliminate any delay that would be caused by an action that blocks other actions from occuring. Since this list is only the top 100 and I would only be executing this program on one server I did not see any significant difference between the two implementations. The total time each script took appeared to rely more on the server it was interacting with and its cache.

Generally I prefer to implement an asynchronous version of my code because blocking UI threads is bad for the user. If my program can do other actions while something is waiting I would rather it continue on. I could look into working with other threads and attempting multithreaded programming but it is really easy to mess this up. Most programs run for a short enough time that running actions sequentially is fine. Recently since I made a slackbot related to my work it made sense to use async/await as slacks rtm api utilizes a websocket connection that is constantly feeding events to the program.

    #!/usr/bin/env python3.6
    import requests
    import re
    from bs4 import BeautifulSoup
    results = list()
    base = "http://beta.fortune.com/best-companies/chg-healthcare-services-{url}"

    for url in range(1,101):
        data = requests.get(base.format(url=url))
        page = BeautifulSoup(data.content, "html5lib")
        title = page.find("title")
        nicetitle = title.text.split(":")[0]
        stats = page.find("table")
        rows = stats.find_all("tr")
        line = ""
        for row in rows:
        td = row.select("td:nth-of-type(3)")
        for d in td:
            text = d.getText()
            if text == "< 1%":
                text = 0
            elif text == "-":
                pass
            else:
                text = re.sub("%","",text)
            if len(text) == 2:
                text = "0."+text
            else:
                text = "0.0"+text

            line += str(text)+"\t"
        results.append(nicetitle+"\t"+line)

    for result in results:
        with open("stats/statssync.txt","a") as f:
        f.write(result.rstrip("\t"))
        f.write("\n")

<!-- break -->

    #!/usr/bin/env python3.6
    import asyncio
    from aiohttp import ClientSession
    import async_timeout
    import re
    from bs4 import BeautifulSoup
    base = "http://beta.fortune.com/best-companies/chg-healthcare-services-{url}"


    loop = asyncio.get_event_loop()
    results = []
    async def fetch(session, url):
        with async_timeout.timeout(10):
            async with session.get(url) as response:
                return await response.text()

    async def processData(data,num):
        page = BeautifulSoup(data, "html5lib")
        title = page.find("title")
        nicetitle = title.text.split(":")[0]
        stats = page.find("table")
        rows = stats.find_all("tr")
        line = ""
        for row in rows:
            td = row.select("td:nth-of-type(3)")
            counter = 0
            maxitem = len(td)
            for d in td:
                text = d.getText()
                if text == "< 1%":
                    text = 0
                elif text == "-":
                    pass
                else:
                    text = re.sub("%","",text)
                    if len(text) == 2:
                        text = "0."+text
                    else:
                        text = "0.0"+text

                counter += 1 
                if counter == maxitem:
                    line += str(text)+"\t"
                else:
                    line += str(text)

        results.append(nicetitle+"\t"+line)


    async def main(loop):
        async with ClientSession(loop=loop) as session:
            for url in range(1,101):
                data = await fetch(session,base.format(url=url))
                await processData(data,url)
            if(len(results) >= 100):
                for result in results:
                    with open("stats/stats.txt","a") as f:
                        f.write(result.rstrip("\t"))
                        f.write("\n")


    loop.run_until_complete(main(loop))

I have found in my testing that the async version runs slower. Essentially since I am scraping from one location and only a hundred pages I am not going to see a whole lot of benefits going the async route. If I had to interact with more pages and more hosts I might see better results. In my testing the synchronous version would run a few seconds faster than the asyncronous version. Once I had the data downloaded I could load it into python pandas to do some analysis and output charts with the results.

tags: how-to