Pages

Sunday, April 3, 2011

More X-Path Goodness

Got a RapidMiner crawling/scraping video coming up, but for now, here are some more X-Path ideas to play with:

//*
return all nodes

//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.

//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)

//div[@id='div1']/../
return the parent node of a specific node

in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a

7 comments:

  1. Hi Neil,
    just playing with x-path and I somewhat hit a wall. You (and the w3 tut) seem to assume in general, that an information is stored in one single cell, or at least say between two tags. But I just dont get how I can extract something from cells which contain more elements. Like on one real estate site i'm playing around with, there is lots of data squeezed between the same tags. How can I pick something out of a running text?
    Thanks so much :)

    ReplyDelete
  2. @anonymous, email me the url you are trying to scrape, and the data you want, and i'll take a look.

    ReplyDelete
  3. Hi Neil,
    How do we extract multiple data of the same tag from one html document in RapidMiner? For example,
    I want to extract all "td class="ar" info below?

    div class="historical-results"
    // div class="left"
    // table class="results-table" cellspacing="0"

    tbody
    tr class="down"
    td class="date">Tue, Oct 12 2010 td>
    td class="ar">0.955 td>
    td class="ar">0.960 td>
    td class="ar">0.945 td>
    td class="ar underline">0.950 td>
    td class="ac"> src="/images/result-arrow-down.png" alt="" /> td>
    td class="ar">-0.005 td>
    td class="ar">-0.52 td>
    td class="ar">56,443 td>
    tr>

    Thank you and look forward to hear from you.

    regards,

    ReplyDelete
  4. @anonymous

    //td[@class='ar']

    should work. it should return all of the elements, and not just one

    ReplyDelete
  5. Hi, Neil!
    I was searching the web to find an answer and I find your blog.
    Do you know what expression can I use in x-path to find the price in this page:
    "http://www.colombo.com.br/produto/Portateis/Panificadora-Multi-Pane-Britania?utm_content=120799-Panificadora-Multi-Pane-Britania&utm_source=Buscape&utm_campaign=Portateis"

    ReplyDelete
  6. @Vannucci : this should help

    https://docs.google.com/spreadsheet/ccc?key=0AreO9JhY28gcdHRONEt1RVliTlIwR3BYX2pIMmlyOHc

    ReplyDelete
  7. Neil: I've done a web scrape from http://www.afghan-bios.info. This is a Joomla! site containing a database on Afghan personalities. I've stored the files locally, each entry in the database is stored as a .html-file. For the last couple of days I've been trying to extract the data from these files (HTML pages)into something useful -> e.g. a database with names, ethnicity and so forth for each individual. The plan is to use these for a social graph in Gephi.

    However, I can not get it to work. I've managed to extract some word lists, but nothing usable. In addition, the reason I scraped the site with another tool was that I couldn't get it to work in RapidMiner. I got zero results when I tried crawling. I checked out your tutorials, but to no avail. I'm sure I'm doing it wrong, but I don't know where.

    Any input / suggestions would be welcome!

    ReplyDelete