Scraping HTML with PHP Node and Puppeteer

Posted by Alan Barr on Fri 28 September 2018

Scraping in 2018

Interestingly enough I receive decent amount of hits on an earlier blog related to web scraping. Not much has changed except that phantomJS is not the most common tool for web scraping. With the Google Chrome team creating headless chrome Puppeteer and similar tools have come around to providing a better experience. I personally do not use PHP as much as I did in the past but a lot of people still use it.

Today I started with spinning up a Ubuntu Linux virtual machine in Azure running the below command to get everything headless chrome required for install.

sudo apt install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget

Then installing php, composer and nodejs, which I recommend going to the nodejs website and using their steps

sudo apt install -y php composer php-mbstring

I found a nice wrapper for Chrome called PuPHPeteer. and ran composer require nesk/puphpeteer and then npminstall @nesk/puphpeteer.

Then I wrote my script using a website I made with VueJS that renders list elements from a json blob.

    <?php
    require("vendor/autoload.php");
    use Nesk\Puphpeteer\Puppeteer;
    use Nesk\Rialto\Data\JsFunction;
    use Nesk\Puphpeteer\Resources\ElementHandle;
    use Sunra\PhpSimple\HtmlDomParser;

    $puppeteer = new Puppeteer;
    $browser = $puppeteer->launch();

    $page = $browser->newPage();
    $page->goto('https://alanmbarr.github.io/HackMidWestTimeline/');

    $data = $page->evaluate(JsFunction::createWithBody('return document.documentElement.outerHTML'));
    $dom = HtmlDomParser::str_get_html( $data );
    $browser->close();

    foreach($dom->find('span') as $element) {
    echo $element->plaintext."\n";
    }

    $dom->clear();
    ?>

Personally I would rather do most of this in NodeJS but if you're pretty used to PHP and not JavaScript this should be a pretty workable solution.