This is part one of a series of video tutorials on web scraping and web crawling.
In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.
Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.
Watch the video here:
Part B of the video is here (sorry about the crap sound, working on it):
grabs all the anchors (hyperlinks) in a document
grabs all the URLs in hyperlinks in a document
grabs all the div elements whose css class start with 'left'
grabs the parent element of all input text elements
returns the number of paragraph elements in a page
find all the hyperlinks that contain the word 'craigslist'
find all the paragraphs that do NOT have center alignment
You can read more about XPath here:
Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages