Pages

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:



Part B of the video is here (sorry about the crap sound, working on it):



Useful XPaths:

//a
grabs all the anchors (hyperlinks) in a document

//a/@href
grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

//input[@type='text']/..
grabs the parent element of all input text elements

count(//p)
returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

https://developer.mozilla.org/en/XPath/Functions

and here:

http://www.w3schools.com/XPath/xpath_syntax.asp

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages