Pages

Showing posts with label web scraping. Show all posts
Showing posts with label web scraping. Show all posts

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:



Part B of the video is here (sorry about the crap sound, working on it):



Useful XPaths:

//a
grabs all the anchors (hyperlinks) in a document

//a/@href
grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

//input[@type='text']/..
grabs the parent element of all input text elements

count(//p)
returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

https://developer.mozilla.org/en/XPath/Functions

and here:

http://www.w3schools.com/XPath/xpath_syntax.asp

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages