Vancouver Data Blog by Neil McGuigan: Web Scraping AJAX Pages

This is part four of a series of video tutorials on web scraping and web crawling.

You can probably skip this one, and go to the easy version.

Part 1: Web scraping with Google Spreadsheets and XPath

Part 2: Web Crawling with RapidMiner

Part 3: Web Scraping with RapidMiner and Xpath

This post explains how to capture HTML from Ajax / Javascript generated pages.

Here is the accompanying video.

The first thing you should know is that it is a major, major pain in the ass. Set aside half a day in your calendar, and get some hard liquor. The scraping itself is easy, but whoever wrote the installers for these programs has serious issues.

This method involves PHP, but is likely simpler if you already know Java.

The main idea is to use the functional testing framework Selenium, which can automate web browsers, such as Chrome. Point it to a URL, have Chrome render the page (including ajax), wait a few seconds, get the HTML from the browser, and save it to a file. This is all done automatically.

I am going to gloss over most of the software installation steps, as they are lengthy, and explained (poorly) elsewhere. I will also not answer any comment questions about the software installation, but I encourage you to help one another. Try stackoverflow.com for help too.

install Java Runtime Environment if you do not already have it
install Selenium Server 2, and run it from the command line: java -jar selenium-server.jar
install PHP, make sure it's in your system path
install PEAR
install PHPUnit with all dependencies (using PEAR, read their site)
install PHPUnit_Selenium with all dependencies (using PEAR)

create a folder 'tests', add phpunit.xml (replace the square brackets with angled brackets):

[phpunit
  colors="false"
  convertErrorsToExceptions="true"
  convertNoticesToExceptions="true"
  convertWarningsToExceptions="true"
  stopOnFailure="false"
  verbose="true"]

 [selenium]
  [browser name="Chrome" browser="*googlechrome" /]
  [!-- [browser name="Internet Explorer" browser="*iexplore" /] --]
 [/selenium]

[/phpunit]

create a file tests/functional/AjaxTest.php:

class AjaxTest extends PHPUnit_Extensions_SeleniumTestCase {

    public function setUp(){
 parent::setUP();

//you would set this to whatever website you want to scrape
 $this->setBrowserUrl('http://dev.sencha.com/');
    }

    public function testA(){

//this is an example of a page that you would want to scrape
 $this->open('deploy/ext-4.0.0/examples/feed-viewer/feed-viewer.html');

 $this->waitForCondition('', 5000); //5 seconds

//save the html output to a file
 file_put_contents("ajax_output.html", $this->getHtmlSource());

}
}

and you're done.

then on the command line go to your tests folder and run phpunit functional

Vancouver Data Blog by Neil McGuigan

Pages

Thursday, February 9, 2012

Web Scraping AJAX Pages

2 comments:

Archive