Pages

Sunday, April 10, 2011

A rainy sunday in downtown Vancouver

My blog should look better on mobile devices now.

Monday, April 4, 2011

Web Scraping with RapidMiner and XPath

In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.




Web Crawling with RapidMiner

Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.




Sunday, April 3, 2011

More X-Path Goodness

Got a RapidMiner crawling/scraping video coming up, but for now, here are some more X-Path ideas to play with:

//*
return all nodes

//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.

//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)

//div[@id='div1']/../
return the parent node of a specific node

in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a