Vancouver Data Blog by Neil McGuigan: Web scraping with Google Spreadsheets and XPath

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:

Part B of the video is here (sorry about the crap sound, working on it):

Useful XPaths:

//a
grabs all the anchors (hyperlinks) in a document

//a/@href
grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

//input[@type='text']/..
grabs the parent element of all input text elements

count(//p)
returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

https://developer.mozilla.org/en/XPath/Functions

and here:

http://www.w3schools.com/XPath/xpath_syntax.asp

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

16 comments:

AnonymousFebruary 28, 2011 at 3:42 AM
Great job! Thx!
ReplyDelete
Replies
AnonymousMarch 1, 2011 at 7:44 AM
HURRAY!
Thank you!
Now we only have to hope that this is going to be an at least 10 part series :))
ReplyDelete
Replies
AnonymousMarch 1, 2011 at 1:47 PM
The Xpath idea is really great for allowing real time data. I played around a bit with HTTRack in the past, but the Xpath option seems to be much more flexible and effective. Thanks!
ReplyDelete
Replies
AnonymousMarch 11, 2011 at 1:17 PM
Tnx! More please :-)
ReplyDelete
Replies
Web Page ScraperMarch 13, 2011 at 9:39 PM
Informative video, which gives us more knowledge about web scraing with google spreadsheet. This spreadsheet will scrap a table from an HTML web page. The url of the target web page and the target table element need to be in double quotes.
ReplyDelete
Replies
AnonymousMarch 18, 2011 at 6:49 PM
Have any of you guys tried Helium Scraper (http://www.heliumscraper.com)? Looks kinda promising...
ReplyDelete
Replies
ProxyMeshMarch 31, 2011 at 6:21 PM
This is really cool, I'm looking forward to the other posts in the series.
ReplyDelete
Replies
AnonymousApril 23, 2011 at 3:14 AM
Thanks a lot! This is just great.

I have a question regarding the hyperlinks:

- I want to put the results returned by crawling the hyperlinks returned with //a/@href, in a different sheet, but I do not know how to make reference to them, since each hyperlink is in a cell with the following reference: CONTINUE(B5,2,1), CONTINUE(B5, 3,1), etc.

Any thoughts on how I could procede?

Thanks a lot!
ReplyDelete
Replies
UnknownApril 25, 2011 at 8:16 PM
Thanks for these videos, They are really informative... Just a thought tho, you really pound on that keyboard, can't be good for it lol

Thanks again!
ReplyDelete
Replies
Neil McGuiganMay 23, 2011 at 10:45 PM
@fash, ya, the microphone is built into my laptop, so it picks up all the keystrokes really loudly! i just got a nice external mic, so hopefully the next video sounds better.
ReplyDelete
Replies
Neil McGuiganMay 23, 2011 at 10:46 PM
@anonymous, you should be able to reference they cell values the same as any other value. the continue() function just allows another function to return values to multiple cells.
ReplyDelete
Replies
AnonymousAugust 1, 2011 at 7:03 AM
If you look for web data scraping tool, you can try http://smokedoc.org - a best known data extracting software!
Quick, precise and very efficient!

Helen.
ReplyDelete
Replies
Chris LeSeptember 25, 2011 at 7:05 AM
Great post!! I'm definitely going to check out RapidMiner! I also started a collection of ImportXML & XPaths because we use them at work. I would love to share more recipes with anyone! :) Check it out ..

ImportXML Cookbook:
http://www.seerinteractive.com/blog/importxml-cookbook/2011/09/17/
ReplyDelete
Replies
AnonymousFebruary 2, 2012 at 1:00 PM
Great videos! How does one obtain the links of the images that an ad has? They seem to be embedded into a table but I've tried accessing them and I can't seem to make it work. I would really love to know how this is done. Thanks again!
ReplyDelete
Replies
Neil McGuiganFebruary 2, 2012 at 5:29 PM
@anonymous, you should be able to find img tags and use their @src attribute. hth
ReplyDelete
Replies
SirniöFebruary 6, 2014 at 12:37 AM
This is the first time that i watch this video tutorial and i can say that i learn about web scraping .To be honest i am doing data analysis but aside from that i wanna learn different kind of things specially in online things.
ReplyDelete
Replies

Pages

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

16 comments:

Archive