Vancouver Data Blog by Neil McGuigan: Web Scraping with RapidMiner and XPath

Monday, April 4, 2011

Web Scraping with RapidMiner and XPath

In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.

Part 1: Web scraping with Google Spreadsheets and XPath

Part 2: Web Crawling with RapidMiner

Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

6 comments:

Extract Data From WebsiteFebruary 3, 2012 at 8:51 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
ehk116February 27, 2012 at 9:39 PM
I am having trouble with the extract information. I am getting a jdomexception. Can you provide an example of the xpath query? I am thinking I put entered it incorrectly. Just one example with the h. would be great.
thanks for all your hard work. much appreciated.
ReplyDelete
Replies
Ryan SMarch 9, 2012 at 7:33 PM
Thanks for posting the video, I've learned a lot! I am new to this so I apologize for my ignorance...Is there a way to extract data from an html table? I've successfully crawled a couple hundred html pages but I'm having trouble extracting the information. I've tried several different variations of the xpath you've described and I can't quite get it to work. I'm trying to find a way to compile competitive product information from this website (the following is a typical page): jerrysartarama.com/discount-art-supplies/paper/drawing-and-multimedia-paper-and-boards/canson/artist-drawing-paper-pads/illustration-pads.htm

The xpath I am trying is:

//h:*[contains(., 'SIZE')]/../h:td[last()]/text()

When I try the similar xpath "//*[contains(., 'SIZE')]/../tr[12]" in google docs I get the 12th row of information (or whatever row I specify) and it is displayed in a way that I want. How can I get Rapidminer to give me the information in an excel format like this? I essentially want the table on the website in excel so I can sort, compare and modify the data. Again sorry for my ignorance, I'm fumbling through this for my first time...Any help would be GREATLY appreciated!
ReplyDelete
Replies
Raúl Bernardo RadosMay 9, 2012 at 6:39 PM
Hi Neil!,
My name is Raúl and I watched your videos which are pretty cool, but I have a question though, but before I have to say it was a big surprise that you're from Vancouver because although I'm from Chile I was living one year at Canada, and I went in a short trip to Vancouver just for the day when Canuks played the third game against Boston!!, go canuks go!!.

Well, I've been doing what I learned through your videos but I cannot just take the texts "SANTIAGO" and "08-05-2012" (these words can change, but the position is the same), the part of the code is:
div...
"
Fecha Publicacion :08-05-2012"

"
Ubicación:
SANTIAGO

"
/div

The website is:

http://www.propiedades.emol.com/propiedad/buscar?region=Metropolitana+de+Santiago&operacion=Arriendo&estado%5B%5D=nueva&estado%5B%5D=usada&propiedad%5B%5D=departamento&dormitorios=1&banos=0&regiondespliegue=15&comuna%5B%5D=Santiago&zona_curr=&moneda=pesos&precioCLPHasta=

I'd appreciate some directions, Regards.
ReplyDelete
Replies
UnknownJune 2, 2012 at 12:24 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment

Pages

Monday, April 4, 2011

Web Scraping with RapidMiner and XPath

6 comments:

Archive