Scraping Dynamic Websites with CasperJS/PhantomJS and PHP / SimpleHTMLDOM

This article is mostly still accurate except that PhantomJS and CasperJS are not actively used now. Please refer to this updated article. A word of warning, the library Simple HTML DOM may not be supported on PHP 7 onwards. There are fixes to the library on the web but consider choosing a different HTML parser.

How to Scrape Dynamic HTML

Simple HTML DOM get dynamic content loaded with JS Much of the web now is a combination of static and dynamic content. Especially with many websites becoming single page web applications in Angular or other frameworks makes it more difficult to scrape content. Most scrapers are built for scraping static and not dynamically rendered data.

At this point one must begin to use a headless browser to render this content so we can process it properly.

In this particular stack overflow question the user has a need to parse the DOM in PHP. This could be done in a variety of languages. There will need to be some tweaks to the library if one is using Linux/Windows as the casperjs binary executable will be located in different places. Typically casperjs is symlinked to /usr/local/bin/ since this may vary update it in the library

  • npm install -g phantomjs
  • npm install -g casperjs
  • composer require phpcasperjs/phpcasperjs
  • composer require sunra/php-simple-html-dom-parser

Warning newer PHP updates might break PHP-Simple-HTML-DOM-Parser. If your code does not work try

  • composer require Kub-AT/php-simple-html-dom-parser
  • replace use Sunra\PhpSimple\HtmlDomParser with use KubAT\PhpSimple\HtmlDomParser;

JavaScript to PHP Example

    use Sunra\PhpSimple\HtmlDomParser;
    use Browser\Casper;
    $casper = new Casper();
    //May need to set more options due to ssl issues
    $casper->setOptions(array('ignore-ssl-errors' => 'yes'));
    $output = $casper->getOutput();
    $html = $casper->getHtml();
    $dom = HtmlDomParser::str_get_html( $html );
    $elems = $dom->find("a");
    foreach($elems as $e){
        echo $e->href;