Pages

Monday, April 4, 2011

Web Scraping with RapidMiner and XPath

In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.




6 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. I am having trouble with the extract information. I am getting a jdomexception. Can you provide an example of the xpath query? I am thinking I put entered it incorrectly. Just one example with the h. would be great.
    thanks for all your hard work. much appreciated.

    ReplyDelete
  3. Thanks for posting the video, I've learned a lot! I am new to this so I apologize for my ignorance...Is there a way to extract data from an html table? I've successfully crawled a couple hundred html pages but I'm having trouble extracting the information. I've tried several different variations of the xpath you've described and I can't quite get it to work. I'm trying to find a way to compile competitive product information from this website (the following is a typical page): jerrysartarama.com/discount-art-supplies/paper/drawing-and-multimedia-paper-and-boards/canson/artist-drawing-paper-pads/illustration-pads.htm

    The xpath I am trying is:

    //h:*[contains(., 'SIZE')]/../h:td[last()]/text()

    When I try the similar xpath "//*[contains(., 'SIZE')]/../tr[12]" in google docs I get the 12th row of information (or whatever row I specify) and it is displayed in a way that I want. How can I get Rapidminer to give me the information in an excel format like this? I essentially want the table on the website in excel so I can sort, compare and modify the data. Again sorry for my ignorance, I'm fumbling through this for my first time...Any help would be GREATLY appreciated!

    ReplyDelete
  4. Hi Neil!,
    My name is Raúl and I watched your videos which are pretty cool, but I have a question though, but before I have to say it was a big surprise that you're from Vancouver because although I'm from Chile I was living one year at Canada, and I went in a short trip to Vancouver just for the day when Canuks played the third game against Boston!!, go canuks go!!.

    Well, I've been doing what I learned through your videos but I cannot just take the texts "SANTIAGO" and "08-05-2012" (these words can change, but the position is the same), the part of the code is:
    div...
    "
    Fecha Publicacion :08-05-2012"


    "
    Ubicación:
    SANTIAGO


    "
    /div

    The website is:

    http://www.propiedades.emol.com/propiedad/buscar?region=Metropolitana+de+Santiago&operacion=Arriendo&estado%5B%5D=nueva&estado%5B%5D=usada&propiedad%5B%5D=departamento&dormitorios=1&banos=0&regiondespliegue=15&comuna%5B%5D=Santiago&zona_curr=&moneda=pesos&precioCLPHasta=

    I'd appreciate some directions, Regards.

    ReplyDelete
    Replies
    1. Raul, i've inspected this HTML, these are sequential HTML elements, not easy to extract with XPath. I recommend you to read on it in http://extract-web-data.com/extracting-sequential-html-element/
      Also Raul, i've scraped it for you using Scraper GC extention, see here: https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdDhlMXZnVFF0YXN1d1BrTEptb0JUZnc&output=html

      Delete
  5. This comment has been removed by a blog administrator.

    ReplyDelete