Some RapidMiner, some JMP, some Google Docs
Please can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me? Thank you so much for the valuable information.
Can you tell me what you wrote in the crawling rules.I can´t see clearly.
I tryed some american websites and it works perfectly, but with some european websites it doesn´t work... Any idea?Ex:http://www.marca.comhttp://www.bild.dehttp://www.marca.comhttp://rtp.pt
1. if a website uses AJAX, you have to crawl it differently. 2. these were my crawling rules:store_with_matching_url .+suiteid.+follow_link_with_matching_url .+pagenum.+|.+suiteid.+3. regarding the european sites, they may check the user agent, so you could try changing that to your browser's, or they may check a cookie.A later video will cover issues 1 and 3. thanksneil
Hi Neil, Any update on crawling AJAX websites? Best solution I have come up with is just saving the search pages I am interested manually as "completed webpages" using Mozilla/Chrome and going from there. I can display up to 500 search results at once for the particular website I am crawling but I can see this being a hastle if its capped at 50 results a page. I think the next step I will take is automate this saving process of the "post-ajax" site to further streamline the over all process. I'll be sure to post back here if I find anything simple enough.Thanks again for the great work on the blog. I was able to get 15 attributes for each of the 2,000 examples in 2 minutes 7 seconds! Looking forward to your future posts!Yaro
Can you give us some suggestions for common expressions?I noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.htmlAny ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.
@anonymous: regular expressions are powerful, but can definitely be tricky. I would recommend checking out this to learn by example:http://www.regular-expressions.info/examples.htmland using this to learn by doing:http://burkeware.com/software/regex_playground.htmlgood luck
The burkeware link is awesome! Makes it so much easier to get regular expressions going. Thanks so much for sharing!
This is incredibly informative. Thank you so much!
how to use this to crawl google??? with other websites rapidminer so powerfull, but it can't handle google. Could you show me how ???
I think Google disallow crawling their website. you can see that here http://www.google.com/robots.txt
To the last commenter: Why would they tell you how to scrape their own website in their own documentation?https://docs.google.com/support/bin/answer.py?answer=155184
Neil, this is really useful. Thanks!Is there anyway to save the page using it's 'Title' tag rather than 0,1,2,3,4....
Hi Neil,Any update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based. I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment along with other vital information by processing about 20 html files using Rapidminer. I am still unsure as to how exactly to process the the html files. (My thought process is to convert to xml first and use xpath not sure yet?). I also just checked and each of the hyperlinks actually goes to a page where other information is available in the source code so I think Rapidminer would have no trouble grabbing the info if I fed all the URLS based on the "post-ajax" webpage.Any help or suggestions are greatly appreciated especially on crawling ajax websites and whether or not there is a feature in Rapidminer that can simplify searching the html file. I am new to this so sorry If I've stated something wrong it's just what i've gathered from reading today.Thanks for posting all of this!
@Thomas, I am not sure exactly, though that page is quite large, and slow to open. Can you try upping the max size and timeout?You could also try my ajax scraper (latest post as of today), even though it doesn't use ajax.
This comment has been removed by the author.
I have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner//h:*[contains(.,'Departure airport:')]/../h:td/a and also //h:*[contains(.,'Departure airport:')]/../h:td/a/text() but no luck...can you suggest something. ----------------------------------------------------trtd class="caption">Departure airport: /tdtd class="desc">VANCOUVER Airport Canada/td/tr-----------------------------------------------------