Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:

Part B of the video is here (sorry about the crap sound, working on it):

Useful XPaths:

grabs all the anchors (hyperlinks) in a document

grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

grabs the parent element of all input text elements

returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

and here:

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages