Scraping HTML with PHP Node and Puppeteer

Scraping in 2018

Interestingly enough I receive decent amount of hits on an earlier blog related to web scraping. Not much has changed except that phantomJS is not the most common tool for web scraping. With the Google Chrome team creating headless chrome Puppeteer and similar tools have come around to providing a better experience. I personally do not use PHP as much as I did in the past but a lot of people still use it.

Today I started with spinning up a Ubuntu Linux virtual machine in Azure running the below command to get everything headless chrome required for install.

sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

Then installing php, composer and nodejs, which I recommend going to the nodejs website and using their steps

sudo apt install -y php composer php-mbstring zip

I found a nice wrapper for Chrome called PuPHPeteer. and ran

composer require nesk/puphpeteer sunra/php-simple-html-dom-parser

Warning newer PHP updates might break PHP-Simple-HTML-DOM-Parser. If your code does not work try

  • composer require Kub-AT/php-simple-html-dom-parser
  • replace use Sunra\PhpSimple\HtmlDomParser with use KubAT\PhpSimple\HtmlDomParser;

and then

npm install @nesk/puphpeteer

Then I wrote my script using a website I made with VueJS that renders list elements from a json blob.

    <?php
    require("vendor/autoload.php");
    use Nesk\Puphpeteer\Puppeteer;
    use Nesk\Rialto\Data\JsFunction;
    use Nesk\Puphpeteer\Resources\ElementHandle;
    use Sunra\PhpSimple\HtmlDomParser;

    $puppeteer = new Puppeteer;
    $browser = $puppeteer->launch();

    $page = $browser->newPage();
    $page->goto('https://alanmbarr.github.io/HackMidWestTimeline/');

    $data = $page->evaluate(JsFunction::createWithBody('return document.documentElement.outerHTML'));
    $dom = HtmlDomParser::str_get_html( $data );
    $browser->close();

    foreach($dom->find('span') as $element) {
    echo $element->plaintext."\n";
    }

    $dom->clear();
    ?>

Personally I would rather do most of this in NodeJS but if you’re pretty used to PHP and not JavaScript this should be a pretty workable solution.