Sunday, April 10, 2011
Monday, April 4, 2011
Web Scraping with RapidMiner and XPath
In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.
Part 2: Web Crawling with RapidMiner
Web Crawling with RapidMiner
Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.
Part 2: Web Crawling with RapidMiner
Sunday, April 3, 2011
More X-Path Goodness
Got a RapidMiner crawling/scraping video coming up, but for now, here are some more X-Path ideas to play with:
//*
return all nodes
//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.
//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)
//div[@id='div1']/../
return the parent node of a specific node
in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a
//*
return all nodes
//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.
//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)
//div[@id='div1']/../
return the parent node of a specific node
in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a
Subscribe to:
Posts (Atom)