Scraping Dynamic Websites with CasperJS/PhantomJS and PHP / SimpleHTMLDOM

Posted by Alan Barr on Fri 07 October 2016

How to Scrape Dynamic HTML

Simple HTML DOM get dynamic content loaded with JS Much of the web now is a combination of static and dynamic content. Especially with many websites becoming single page web applications in Angular or other frameworks makes it more difficult to scrape content. Most scrapers are built for scraping static and not dynamically rendered data.

At this point one must begin to use a headless browser to render this content so we can process it properly.

In this particular stack overflow question the user has a need to parse the DOM in PHP. This could be done in a variety of languages. There will need to be some tweaks to the library if one is using Linux/Windows as the casperjs binary executable will be located in different places. Typically casperjs is symlinked to /usr/local/bin/ since this may vary update it in the library

  • npm install -g phantomjs
  • npm install -g casperjs
  • composer require phpcasperjs/phpcasperjs
  • composer require sunra/php-simple-html-dom-parser

JavaScript to PHP Example

<?php
require("vendor/autoload.php");
use Sunra\PhpSimple\HtmlDomParser;
use Browser\Casper;
$casper = new Casper();
//May need to set more options due to ssl issues
$casper->setOptions(array('ignore-ssl-errors' => 'yes'));
$casper->start('https://www.reddit.com');
$casper->wait(5000);
$output = $casper->getOutput();
$casper->run();
$html = $casper->getHtml();
$dom = HtmlDomParser::str_get_html( $html );
$elems = $dom->find("a");
foreach($elems as $e){
    echo $e->href;
}
?>