Pages

Thursday, February 9, 2012

Web Scraping AJAX Pages

This is part four of a series of video tutorials on web scraping and web crawling.

You can probably skip this one, and go to the easy version.


This post explains how to capture HTML from Ajax / Javascript generated pages.

Here is the accompanying video.

The first thing you should know is that it is a major, major pain in the ass. Set aside half a day in your calendar, and get some hard liquor. The scraping itself is easy, but whoever wrote the installers for these programs has serious issues.

This method involves PHP, but is likely simpler if you already know Java.

The main idea is to use the functional testing framework Selenium, which can automate web browsers, such as  Chrome. Point it to a URL, have Chrome render the page (including ajax), wait a few seconds, get the HTML from the browser, and save it to a file. This is all done automatically.

I am going to gloss over most of the software installation steps, as they are lengthy, and explained (poorly) elsewhere. I will also not answer any comment questions about the software installation, but I encourage you to help one another. Try stackoverflow.com for help too.

install Java Runtime Environment if you do not already have it
install Selenium Server 2, and run it from the command line: java -jar selenium-server.jar
install PHP, make sure it's in your system path
install PEAR
install PHPUnit with all dependencies (using PEAR, read their site)
install PHPUnit_Selenium with all dependencies (using PEAR)

create a folder 'tests', add phpunit.xml (replace the square brackets with angled brackets):

[phpunit
  colors="false"
  convertErrorsToExceptions="true"
  convertNoticesToExceptions="true"
  convertWarningsToExceptions="true"
  stopOnFailure="false"
  verbose="true"]

 [selenium]
  [browser name="Chrome" browser="*googlechrome" /]
  [!-- [browser name="Internet Explorer" browser="*iexplore" /] --]
 [/selenium]

[/phpunit]

create a file tests/functional/AjaxTest.php:
class AjaxTest extends PHPUnit_Extensions_SeleniumTestCase {

    public function setUp(){
 parent::setUP();

//you would set this to whatever website you want to scrape
$this->setBrowserUrl('http://dev.sencha.com/');     }     public function testA(){

//this is an example of a page that you would want to scrape
$this->open('deploy/ext-4.0.0/examples/feed-viewer/feed-viewer.html');
 $this->waitForCondition('', 5000); //5 seconds
//save the html output to a file
file_put_contents("ajax_output.html", $this->getHtmlSource());
    }
}
and you're done.

then on the command line go to your tests folder and run phpunit functional

2 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Good Information.. But some other Advance Knowledges of
    Knowledge Discovery through Data Mining And Web Scraping programming.
    please see the my site...

    ReplyDelete