Vancouver Data Blog by Neil McGuigan: Less Painful AJAX / Javascript Web Scraping

Saturday, February 11, 2012

Less Painful AJAX / Javascript Web Scraping

If you read my previous post, you'll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder.

Here's the how-to video:

You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty is implied. You can get the file here:

http://dl.dropbox.com/u/1015920/VancouverData/ajaxscraper.zip

Mad props to the Selenium team

9 comments:

Neil McGuiganFebruary 26, 2012 at 10:29 AM
@emre glad you like it. And glad it helped. Please share it around! Cheers

Neil
ReplyDelete
Replies
Neil PatelMarch 27, 2012 at 10:05 AM
Thanks for all your videos. I love them. So I have a project I'm going to attempt to work on which would use the clustering video examples you have shown us. I just have one thing I wanted to ask you in regards to how my data is initially structured. So I work in the oil and gas markets and I wanted to profile and cluster various groups of wells declines rates over time. So my data would be in an excel spreadsheet with each well on a row and each column representing a monthly oil production number. So the data would like this in MS Excel with each row a well and each column a production number, aside from the first column.

Well NAME Month1 Month2 Month3 Month4 Month5 Month6 Month7...
Gonzales 100 96 93 85 70 65 30
Dewitt 300 200 100 50 20 10 2
Vancouver 50 45 30 28 23 22 21
Horizon 450 430 420 410 380 350 200
...
...
..

would you have any idea on how to help me initially feed my data into rapidminer before I use the cluster operators?
ReplyDelete
Replies
Neil McGuiganMarch 27, 2012 at 10:15 AM
@Neil

looks like you have a time-series there. it looks right in that the attributes are on the top and entities on the left.

you might start by changing the numbers to their rates of change. For example, instead of Horizon Month1->Month2 as 450->430, you could have Horizon ToMonth2 as -4.44%, and similar, so that everything is comparable.

clustering would find observations (wells in this case) that are similar based on the provided attributes, which are the monthly volumes or rates of change. is that what you're looking for?
ReplyDelete
Replies
MatthiasDecember 21, 2012 at 1:23 PM
Hi Neil,

recently I was doing a lot of web mining with RapidMiner. Nowadays you can hardly get away without having some websites with dynamically loaded content (using ajax). A lot of months have passed since my last RapidMiner activities. So I searched the forums to find out if there possibly was some progress in handling JavaScript. I found no clue for this but a topic where you pointed to Selenium and Chrome for ajax scraping. Finally this led me to your blog and your ajaxscraper tool. The video demonstrates a nice piece of software and made me want to have a look at it. Sadly it doesn't work for me.

I didn't search for alternatives yet, but I wanted to let you know about possible issues. I have JRE 1.6.0 (update 38) installed and tried 32 bit as well as the 64 bit version. When executing the jar file I get the "Opening browser..." line printed to console but nothing else happens. Only the output folder with its two subfolders is generated. Firefox is at 17.0.1. Any ideas? Maybe some more debug output might help?

Regards
Matthias
ReplyDelete
Replies
AnonymousJanuary 11, 2013 at 12:15 AM
I am working on extracting information of third-party advertisements on a given webpage. I did use some HTML parser like htmlunit etc. but realized that most of the third party are dynamic and their information cannot be extracted using static parsing. Most of them are inside iframe tags. Is there any way I can get the information of the ads which are embedded inside these iframes.

Can I use htmlunit or selenium to do something like this. These webdrivers just simulate the functions of web browsers, so I thought I can use this in Java.
OR
Can I make use of the adblockplus libraries in some way to do the required task. Adblockplus removes ads, so instead of blocking the ads, I can use them to just get the information of the ads. Is this possible ? How ?

I have been working on this for the past 10 days and I am kind of stuck. I am asking this question personally to you because I have asked this question on several forums but have failed to receive a satisfying response. Would be great if you can kindly give me some clue so that I can start working on it. Any help would be greatly appreciated.
ReplyDelete
Replies
AnonymousJanuary 11, 2013 at 2:20 AM
Can you please share the source code as well ?
ReplyDelete
Replies
AnonymousFebruary 20, 2013 at 7:48 AM
I have the same problem as @Matthias ... Just get the opening browser text and then nothing happens..
ReplyDelete
Replies
ChrisApril 1, 2013 at 8:49 AM
Hi Neil, first, thank you for posting videos and tutorials. I appreciate your efforts very much. I am currently trying to mine a website called https://www.cdproject.net/ for research. I have followed your instructions, but I end with the following errors:

[I have successfully installed phpunit and selenium]
(1) Hard way of scraping
after I run phpunit functional I get: '".\php.exe" is not recognized as an internal or external command...
(I have added the path variable, and on PEAR I have added the following:
SET "PHP_PEAR_PHP_BIN=php\php.exe"

What's interesting is after I run ' pear install phpunit/PHPUnit '
and run ' phpunit functional ' again I get this:

require_once(File/Iterator/Autoload.php) .... in C:\php\pear\phpunit\Autoload.php on line 45

I checked this path to make sure i have autoload in that directory.
I have also added the following: include_path = "c:\php\pear" in my php.ini-dist file.

I was wondering if you have any suggestions what to check.

(2) Easy way of scraping
I recently installed Firefox to do the easy scraping option. Unfortunately, it freezes with opening browser
(scraping default URL of vancouverdata.blogspot.ca and rapidminer).

I will try a restart comp to see if that does anything.

Last question:
Is it possible to mine a site with a login? ie https://www.cdproject.net/

Thank you for all the help!
Chris
ReplyDelete
Replies
UnknownJune 7, 2013 at 5:43 AM
Hi,

Great program but I'm having trouble getting it to work when opening the browser.
The error is below. Do you have any idea of this issue?
Opening browser...
Unable to bind to locking port 7054 within 45000 ms
Build info: version: '2.19.0', revision: '15849', time: '2012-02-08 16:12:19'
System info: os.name: 'Windows 7', os.arch: 'amd64', os.version: '6.1', java.version: '1.7.0_05'
Driver info: driver.version: FirefoxDriver

Thank you
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Saturday, February 11, 2012

Less Painful AJAX / Javascript Web Scraping

9 comments:

Archive