This is part one of a series of video tutorials on web scraping and web crawling.
In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.
Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.
Watch the video here:
Part B of the video is here (sorry about the crap sound, working on it):
Useful XPaths:
//a
grabs all the anchors (hyperlinks) in a document
//a/@href
grabs all the URLs in hyperlinks in a document
//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'
//input[@type='text']/..
grabs the parent element of all input text elements
count(//p)
returns the number of paragraph elements in a page
//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'
//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment
You can read more about XPath here:
https://developer.mozilla.org/en/XPath/Functions
and here:
http://www.w3schools.com/XPath/xpath_syntax.asp
Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages