Scraping HTML to save leads to a database

Posted by Alan Barr on Wed 25 January 2017

I have spent my early programming career working with dynamic programming languages. Dynamic programming languages are a lot of fun and there is much less overhead with ceremony, boilerplate, and thinking about types. In web marketing correctness is not that big a deal usually the websites work or they don't they are attractive or they're not. They have functionality or it is missing. I am now working in the Financial Services industry in primarily a dot net shop with a small amount of linux/php lurking in corners. I am not much of a fan of dot net technologies but in some ways parts of it are not that bad. I would prefer working with more linux and open source related stacks but the mission of the company is more important to me than the stack. If standardizing on one set of technologies makes the work more sane and less prone to errors and problems when making changes then its for the best. I do not see any point in being partisan about a technology solution everything has tradeoffs and there are personal preferences in how people approach code. If we can build consensus and get everyone's input and just settle on a choice and give reasons why the team made those choices then its much easier to accept and to own our output.

Dot Net Core is very much in progress and development. I would not expect anyone to base any business process on it as of yet. Using Visual Studio Code with it is a pleasure. While not as light as a text editor and not as heavy weight as a full IDE the experience is pretty amazing. Coming from languages with not much Intellisense type of support it feels great to get feedback if I am doing something silly. Though there have been times when I get phantom complaints from VS Code about my classes. My guess that when I define a class in another file delete it then readd it in some way it may generate metadata in the background and not properly clean it up.

Coming from a PHP/JavaScript background I have never found a huge value in classes and inheritance for my problems. Functional programming styles and procedural programming practices. I have found that functions provide encapsulation that I want and the procedural style allowed more tweaking and prototyping. When I run into wanting to have some kind of module that has the same kind of properties and possibly implemented slightly differently that's when I want to use Object Oriented Programming. Working for a company that is very C# focused I am working on understanding more about OOP so I may better debug and find issues.

This project came out of the desire of learning more about how classes and OOP work in C#, wanting to learn about dotnet core and wanting to organize leads for myself when I teach programming. There are a couple of online teaching marketplaces and I wanted a way to store information about leads and contact them using my own platform tools.

The nice benefit of dotnet core is that its focus is on being cross platform. I can develop on Windows/Linux/OSX and deploy for each using the same tooling. I personally prefer to work with Linux servers and I find a GUI very distracting to work with. If the financial math changed I would change my opinion as well. Install Core for your local environment. I ended up having to install version 1.0.3 LTS version because of library dependencies that my HTML scraper has. As core is still pretty new check all your dependencies first before diving into the latest version.

Once install on the command line use dotnet new and by default a new console/command line application will be generated in the directory one runs the file. There are other flags that can be passed if you want to start a web service instead. For this project I just want to have this app run via cron and not bother with it having api endpoints.

Once the files are in place there will be a Program.cs file that has a basic hello world program ready to run. A big difference between the old dotnet and the new core is a focus on JSON configuration files instead of XML ones. There seems to be support for some XML style configuration but I did not explore much of this. When I finally reached the point of deploying and running my program I ran into some trouble. First is that in the project.json there is a frameworks section. If one leaves the platform in this causes problems because it assumes the server that will run the code has everything already and in my case it did not. Another issue is that a toplevel runtimes configuration needs to have key value pairs of the key environments and empty objects as the values. This generates content in bin/debug/netcoreapp1.0.3/ folder with subfolders of the different environments with publish folders under those. So when I went to scp the files to my server copying centos.7-x64 and running the dll there was not enough. I had to go into the publish subfolder under that to run the code on the server.

The last annoying configuration issue that I faced was configuring the publishOptions to include publishing my appsettings.json and my project.json otherwise my code would not run. Once all that was settled my challenge became how to level httpclient to login to the website as myself and how to handle the html content. Then post it to my database with the values that I needed. Eventually creating logs so that I could monitor the app without having to manipulate a lot of code to do so.

One place I must give C# credit for is its support for async await which makes asynchronous programming more linear. It does a good job of keeping developers productive and not worrying so much about callbacks and continuation passing. I'm a little uneasy with Linq though coming from a more functional JavaScript area I get this feeling Linq is doing a lot for me under the hood.

My program consists of three classes. The first is the main program file which runs one asynchronous task and waits for the result. Secondly I have a lead class which defines what a student lead object appears like. Most of this information is defined in the html I scrape.

Finally I have a pretty big HTTPClient Class that does most of the work. While researching how to do this I ran into a suggestion not to have a wrapped using statement. In C# some classes have an IDisposable method which means this class when instantiated should be destroyed at a certain point so excess memory is not used. However, if one does this pattern with the httpclient it will leave a bunch of lingering processes so its best just to instantiate one httpclient mark it static and use that for the duration of the program. Looking into the issue more it is rather unclear as to the best protocol. This program does not run very long nor scrape many pages so the risk of memory leaks is low. As always monitor your application and its resource usage.

Inside this HttpClient class I setup a bunch of default fields to hold data about the endpoints and links I need to query. In the constructor I setup the HTTPClient with a HTTPClient Handler to set a cookie jar container for my requests. One issue I ran into was the host I was scraping did not want to respond when I did not provide a useragent. I created a big function that handles logging in requesting the page to scrape and passing its content into the AngleSharp library. AngleSharp constrained me to the LTS version for dot net core but it had everything I needed for scraping and navigating html with css selectors.

Once I scraped the html for the content I need I instantiated a new lead object and set some default parameters that are always there then did some checks and set the optional parameters as well. I added these leads to a list then sent that list to another function to send my leads back to another service. On my server I have couchdb setup but I did not want to spend extra time configuring more data and exposing a port to my server and so forth. Instead I relied on Azure Functions (Similar to AWS Lambda) to send my data to an endpoint and do an action with it.

I created two endpoints for sending my data to. One would save the data to an Azure Storage table. The other would check to see if an ID was already set in the table and if so notify my program not to bother saving the lead to the database.

I appreciate how simple it is to hook up an Azure Storage Table to an Azure Function Endpoint as an input or an output. Normally I do this kind of endpoint in JavaScript but since my goal this year is to master C# I am attempting to do it all in C#. The Azure Storage Explorer is a very handy way to navigate storage tables.

Essentially the azure function has to have an argument in the run function that specifies the input or output table. If querying the table then it is an IQueryable if adding to the table then it is an ICollector. There might be more nuance as well but this was good enough to save my data to the storage table.

The next steps for me is to either build a front end or leverage an existing api service to manage my leads and update them as skipped or pursued.

Once I had completed the program I decided I needed a logger and using this article I implemented logging by using Microsoft logging extensions for the abstraction and Serilog to have a rolling file saved to the hard disk. I created an ApplicationLogging.cs class and used the tips from the article to instantiate the logger in my Program.cs and my HTTPClientService.cs. I deployed to my server and now have a cron scraping hourly and saving logs so I can see if any random errors are occuring.

tags: how-to