Pages

Monday, April 4, 2011

Web Crawling with RapidMiner

Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.




23 comments:

  1. Please can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me?
    Thank you so much for the valuable information.

    ReplyDelete
  2. Can you tell me what you wrote in the crawling rules.I can´t see clearly.

    ReplyDelete
  3. I tryed some american websites and it works perfectly, but with some european websites it doesn´t work...

    Any idea?

    Ex:
    http://www.marca.com
    http://www.bild.de
    http://www.marca.com
    http://rtp.pt

    ReplyDelete
  4. 1. if a website uses AJAX, you have to crawl it differently.

    2. these were my crawling rules:

    store_with_matching_url .+suiteid.+
    follow_link_with_matching_url .+pagenum.+|.+suiteid.+

    3. regarding the european sites, they may check the user agent, so you could try changing that to your browser's, or they may check a cookie.

    A later video will cover issues 1 and 3. thanks

    neil

    ReplyDelete
    Replies
    1. Hi Neil,

      Any update on crawling AJAX websites? Best solution I have come up with is just saving the search pages I am interested manually as "completed webpages" using Mozilla/Chrome and going from there. I can display up to 500 search results at once for the particular website I am crawling but I can see this being a hastle if its capped at 50 results a page. I think the next step I will take is automate this saving process of the "post-ajax" site to further streamline the over all process. I'll be sure to post back here if I find anything simple enough.

      Thanks again for the great work on the blog. I was able to get 15 attributes for each of the 2,000 examples in 2 minutes 7 seconds!

      Looking forward to your future posts!

      Yaro

      Delete
  5. Can you give us some suggestions for common expressions?

    I noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.html

    Any ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.

    ReplyDelete
  6. @anonymous: regular expressions are powerful, but can definitely be tricky. I would recommend checking out this to learn by example:

    http://www.regular-expressions.info/examples.html

    and using this to learn by doing:

    http://burkeware.com/software/regex_playground.html

    good luck

    ReplyDelete
  7. The burkeware link is awesome! Makes it so much easier to get regular expressions going. Thanks so much for sharing!

    ReplyDelete
  8. This is incredibly informative. Thank you so much!

    ReplyDelete
  9. how to use this to crawl google??? with other websites rapidminer so powerfull, but it can't handle google. Could you show me how ???

    ReplyDelete
  10. I think Google disallow crawling their website. you can see that here http://www.google.com/robots.txt

    ReplyDelete
  11. To the last commenter: Why would they tell you how to scrape their own website in their own documentation?

    https://docs.google.com/support/bin/answer.py?answer=155184

    ReplyDelete
  12. Neil, this is really useful. Thanks!
    Is there anyway to save the page using it's 'Title' tag rather than 0,1,2,3,4....

    ReplyDelete
  13. Hi Neil,

    Any update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based.

    I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment along with other vital information by processing about 20 html files using Rapidminer. I am still unsure as to how exactly to process the the html files. (My thought process is to convert to xml first and use xpath not sure yet?). I also just checked and each of the hyperlinks actually goes to a page where other information is available in the source code so I think Rapidminer would have no trouble grabbing the info if I fed all the URLS based on the "post-ajax" webpage.

    Any help or suggestions are greatly appreciated especially on crawling ajax websites and whether or not there is a feature in Rapidminer that can simplify searching the html file. I am new to this so sorry If I've stated something wrong it's just what i've gathered from reading today.

    Thanks for posting all of this!

    ReplyDelete
  14. i also can't seem to get certain sites to work. for example:
    http://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg/-/-/-/EURO--1000,00

    doesn't use cookies, doesn't seem to use ajax (maybe i am wrong about it?), i set the user agent properly, i uncheck "obey robot exclusion" and check "really ignore exclusion".
    anyone knows why?

    ReplyDelete
  15. @Thomas, I am not sure exactly, though that page is quite large, and slow to open. Can you try upping the max size and timeout?

    You could also try my ajax scraper (latest post as of today), even though it doesn't use ajax.

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. I have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner

    //h:*[contains(.,'Departure airport:')]/../h:td[2]/a and also
    //h:*[contains(.,'Departure airport:')]/../h:td[2]/a/text()
    but no luck...can you suggest something.
    ----------------------------------------------------
    tr
    td class="caption">Departure airport: /td
    td class="desc">VANCOUVER Airport Canada/td
    /tr
    -----------------------------------------------------

    ReplyDelete
  18. Hi, i completed my UG in IT last and interested in web designing and seo facts. I am just a newbie to this term web crawling and rapid miner web. I read and gathered many basic info bout this from WebSpiders.biz. They just presented a good facts about web designing and crawling. Check out.....

    ReplyDelete
  19. Hello,
    @ Neil sir,
    I have followed all the steps of Your video as it is, but still am not able get the list of Webpages as output.In the result mode i only headings :"Row No. Link Document_source ".

    please help me out.

    ReplyDelete
  20. No filename given for result file, using stdout for logging results!
    Process //NewLocalRepository/crawl_web_pages starts
    PM INFO: Loading initial data.
    Saving results.
    Process //NewLocalRepository/crawl_web_pages finished successfully after 11 s

    after reunnig the Crawl Web operator am getting above specified lines in Log and example set is null(i.e only column headings).plz help me out.

    ReplyDelete
  21. Excellent video,

    I tried with my RapidMiner V5.3.015 and web Mining 5.3.1 which is checked.
    But I can not find web mining extension in operators view as well as crawl web operator to use it.
    Any ideas?

    ReplyDelete