If you read my previous post, you'll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder.
Here's the how-to video:
You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty is implied. You can get the file here:
http://dl.dropbox.com/u/1015920/VancouverData/ajaxscraper.zip
Mad props to the Selenium team
Saturday, February 11, 2012
Thursday, February 9, 2012
Web Scraping AJAX Pages
This is part four of a series of video tutorials on web scraping and web crawling.
You can probably skip this one, and go to the easy version.
This post explains how to capture HTML from Ajax / Javascript generated pages.
Here is the accompanying video.
The first thing you should know is that it is a major, major pain in the ass. Set aside half a day in your calendar, and get some hard liquor. The scraping itself is easy, but whoever wrote the installers for these programs has serious issues.
This method involves PHP, but is likely simpler if you already know Java.
The main idea is to use the functional testing framework Selenium, which can automate web browsers, such as Chrome. Point it to a URL, have Chrome render the page (including ajax), wait a few seconds, get the HTML from the browser, and save it to a file. This is all done automatically.
I am going to gloss over most of the software installation steps, as they are lengthy, and explained (poorly) elsewhere. I will also not answer any comment questions about the software installation, but I encourage you to help one another. Try stackoverflow.com for help too.
install Java Runtime Environment if you do not already have it
install Selenium Server 2, and run it from the command line: java -jar selenium-server.jar
install PHP, make sure it's in your system path
install PEAR
install PHPUnit with all dependencies (using PEAR, read their site)
install PHPUnit_Selenium with all dependencies (using PEAR)
create a folder 'tests', add phpunit.xml (replace the square brackets with angled brackets):
create a file tests/functional/AjaxTest.php:
then on the command line go to your tests folder and run phpunit functional
You can probably skip this one, and go to the easy version.
Part 2: Web Crawling with RapidMiner
This post explains how to capture HTML from Ajax / Javascript generated pages.
Here is the accompanying video.
The first thing you should know is that it is a major, major pain in the ass. Set aside half a day in your calendar, and get some hard liquor. The scraping itself is easy, but whoever wrote the installers for these programs has serious issues.
This method involves PHP, but is likely simpler if you already know Java.
The main idea is to use the functional testing framework Selenium, which can automate web browsers, such as Chrome. Point it to a URL, have Chrome render the page (including ajax), wait a few seconds, get the HTML from the browser, and save it to a file. This is all done automatically.
I am going to gloss over most of the software installation steps, as they are lengthy, and explained (poorly) elsewhere. I will also not answer any comment questions about the software installation, but I encourage you to help one another. Try stackoverflow.com for help too.
install Java Runtime Environment if you do not already have it
install Selenium Server 2, and run it from the command line: java -jar selenium-server.jar
install PHP, make sure it's in your system path
install PEAR
install PHPUnit with all dependencies (using PEAR, read their site)
install PHPUnit_Selenium with all dependencies (using PEAR)
create a folder 'tests', add phpunit.xml (replace the square brackets with angled brackets):
[phpunit colors="false" convertErrorsToExceptions="true" convertNoticesToExceptions="true" convertWarningsToExceptions="true" stopOnFailure="false" verbose="true"] [selenium] [browser name="Chrome" browser="*googlechrome" /] [!-- [browser name="Internet Explorer" browser="*iexplore" /] --] [/selenium] [/phpunit]
create a file tests/functional/AjaxTest.php:
class AjaxTest extends PHPUnit_Extensions_SeleniumTestCase {
public function setUp(){
parent::setUP();
//you would set this to whatever website you want to scrape$this->setBrowserUrl('http://dev.sencha.com/'); } public function testA(){
//this is an example of a page that you would want to scrape$this->open('deploy/ext-4.0.0/examples/feed-viewer/feed-viewer.html');
$this->waitForCondition('', 5000); //5 seconds
//save the html output to a filefile_put_contents("ajax_output.html", $this->getHtmlSource());
} }and you're done.
then on the command line go to your tests folder and run phpunit functional
Subscribe to:
Posts (Atom)