Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.
Please can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me? Thank you so much for the valuable information.
Any update on crawling AJAX websites? Best solution I have come up with is just saving the search pages I am interested manually as "completed webpages" using Mozilla/Chrome and going from there. I can display up to 500 search results at once for the particular website I am crawling but I can see this being a hastle if its capped at 50 results a page. I think the next step I will take is automate this saving process of the "post-ajax" site to further streamline the over all process. I'll be sure to post back here if I find anything simple enough.
Thanks again for the great work on the blog. I was able to get 15 attributes for each of the 2,000 examples in 2 minutes 7 seconds!
Can you give us some suggestions for common expressions?
I noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.html
Any ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.
Any update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based.
I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment along with other vital information by processing about 20 html files using Rapidminer. I am still unsure as to how exactly to process the the html files. (My thought process is to convert to xml first and use xpath not sure yet?). I also just checked and each of the hyperlinks actually goes to a page where other information is available in the source code so I think Rapidminer would have no trouble grabbing the info if I fed all the URLS based on the "post-ajax" webpage.
Any help or suggestions are greatly appreciated especially on crawling ajax websites and whether or not there is a feature in Rapidminer that can simplify searching the html file. I am new to this so sorry If I've stated something wrong it's just what i've gathered from reading today.
i also can't seem to get certain sites to work. for example: http://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg/-/-/-/EURO--1000,00
doesn't use cookies, doesn't seem to use ajax (maybe i am wrong about it?), i set the user agent properly, i uncheck "obey robot exclusion" and check "really ignore exclusion". anyone knows why?
I have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner
//h:*[contains(.,'Departure airport:')]/../h:td[2]/a and also //h:*[contains(.,'Departure airport:')]/../h:td[2]/a/text() but no luck...can you suggest something. ---------------------------------------------------- tr td class="caption">Departure airport: /td td class="desc">VANCOUVER Airport Canada/td /tr -----------------------------------------------------
Hi, i completed my UG in IT last and interested in web designing and seo facts. I am just a newbie to this term web crawling and rapid miner web. I read and gathered many basic info bout this from WebSpiders.biz. They just presented a good facts about web designing and crawling. Check out.....
Hello, @ Neil sir, I have followed all the steps of Your video as it is, but still am not able get the list of Webpages as output.In the result mode i only headings :"Row No. Link Document_source ".
No filename given for result file, using stdout for logging results! Process //NewLocalRepository/crawl_web_pages starts PM INFO: Loading initial data. Saving results. Process //NewLocalRepository/crawl_web_pages finished successfully after 11 s
after reunnig the Crawl Web operator am getting above specified lines in Log and example set is null(i.e only column headings).plz help me out.
I tried with my RapidMiner V5.3.015 and web Mining 5.3.1 which is checked. But I can not find web mining extension in operators view as well as crawl web operator to use it. Any ideas?
Nice!
ReplyDeletePlease can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me?
ReplyDeleteThank you so much for the valuable information.
Can you tell me what you wrote in the crawling rules.I can´t see clearly.
ReplyDelete1. if a website uses AJAX, you have to crawl it differently.
ReplyDelete2. these were my crawling rules:
store_with_matching_url .+suiteid.+
follow_link_with_matching_url .+pagenum.+|.+suiteid.+
3. regarding the european sites, they may check the user agent, so you could try changing that to your browser's, or they may check a cookie.
A later video will cover issues 1 and 3. thanks
neil
Hi Neil,
DeleteAny update on crawling AJAX websites? Best solution I have come up with is just saving the search pages I am interested manually as "completed webpages" using Mozilla/Chrome and going from there. I can display up to 500 search results at once for the particular website I am crawling but I can see this being a hastle if its capped at 50 results a page. I think the next step I will take is automate this saving process of the "post-ajax" site to further streamline the over all process. I'll be sure to post back here if I find anything simple enough.
Thanks again for the great work on the blog. I was able to get 15 attributes for each of the 2,000 examples in 2 minutes 7 seconds!
Looking forward to your future posts!
Yaro
Can you give us some suggestions for common expressions?
ReplyDeleteI noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.html
Any ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.
@anonymous: regular expressions are powerful, but can definitely be tricky. I would recommend checking out this to learn by example:
ReplyDeletehttp://www.regular-expressions.info/examples.html
and using this to learn by doing:
http://burkeware.com/software/regex_playground.html
good luck
The burkeware link is awesome! Makes it so much easier to get regular expressions going. Thanks so much for sharing!
ReplyDeleteThis is incredibly informative. Thank you so much!
ReplyDeletehow to use this to crawl google??? with other websites rapidminer so powerfull, but it can't handle google. Could you show me how ???
ReplyDeleteI think Google disallow crawling their website. you can see that here http://www.google.com/robots.txt
ReplyDeleteTo the last commenter: Why would they tell you how to scrape their own website in their own documentation?
ReplyDeletehttps://docs.google.com/support/bin/answer.py?answer=155184
Neil, this is really useful. Thanks!
ReplyDeleteIs there anyway to save the page using it's 'Title' tag rather than 0,1,2,3,4....
Hi Neil,
ReplyDeleteAny update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based.
I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment along with other vital information by processing about 20 html files using Rapidminer. I am still unsure as to how exactly to process the the html files. (My thought process is to convert to xml first and use xpath not sure yet?). I also just checked and each of the hyperlinks actually goes to a page where other information is available in the source code so I think Rapidminer would have no trouble grabbing the info if I fed all the URLS based on the "post-ajax" webpage.
Any help or suggestions are greatly appreciated especially on crawling ajax websites and whether or not there is a feature in Rapidminer that can simplify searching the html file. I am new to this so sorry If I've stated something wrong it's just what i've gathered from reading today.
Thanks for posting all of this!
i also can't seem to get certain sites to work. for example:
ReplyDeletehttp://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg/-/-/-/EURO--1000,00
doesn't use cookies, doesn't seem to use ajax (maybe i am wrong about it?), i set the user agent properly, i uncheck "obey robot exclusion" and check "really ignore exclusion".
anyone knows why?
@Thomas, I am not sure exactly, though that page is quite large, and slow to open. Can you try upping the max size and timeout?
ReplyDeleteYou could also try my ajax scraper (latest post as of today), even though it doesn't use ajax.
This comment has been removed by the author.
ReplyDeleteI have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner
ReplyDelete//h:*[contains(.,'Departure airport:')]/../h:td[2]/a and also
//h:*[contains(.,'Departure airport:')]/../h:td[2]/a/text()
but no luck...can you suggest something.
----------------------------------------------------
tr
td class="caption">Departure airport: /td
td class="desc">VANCOUVER Airport Canada/td
/tr
-----------------------------------------------------
Hi, i completed my UG in IT last and interested in web designing and seo facts. I am just a newbie to this term web crawling and rapid miner web. I read and gathered many basic info bout this from WebSpiders.biz. They just presented a good facts about web designing and crawling. Check out.....
ReplyDeleteHello,
ReplyDelete@ Neil sir,
I have followed all the steps of Your video as it is, but still am not able get the list of Webpages as output.In the result mode i only headings :"Row No. Link Document_source ".
please help me out.
No filename given for result file, using stdout for logging results!
ReplyDeleteProcess //NewLocalRepository/crawl_web_pages starts
PM INFO: Loading initial data.
Saving results.
Process //NewLocalRepository/crawl_web_pages finished successfully after 11 s
after reunnig the Crawl Web operator am getting above specified lines in Log and example set is null(i.e only column headings).plz help me out.
Excellent video,
ReplyDeleteI tried with my RapidMiner V5.3.015 and web Mining 5.3.1 which is checked.
But I can not find web mining extension in operators view as well as crawl web operator to use it.
Any ideas?