Vancouver Data Blog by Neil McGuigan: Web Crawling with RapidMiner

Monday, April 4, 2011

Web Crawling with RapidMiner

Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.

Part 1: Web scraping with Google Spreadsheets and XPath

Part 2: Web Crawling with RapidMiner

Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

22 comments:

UnknownApril 5, 2011 at 8:49 AM
Nice!
ReplyDelete
Replies
AnonymousApril 5, 2011 at 11:28 PM
Please can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me?
Thank you so much for the valuable information.
ReplyDelete
Replies
AnonymousApril 11, 2011 at 2:16 PM
Can you tell me what you wrote in the crawling rules.I can´t see clearly.
ReplyDelete
Replies
Neil McGuiganApril 11, 2011 at 5:21 PM
1. if a website uses AJAX, you have to crawl it differently.

2. these were my crawling rules:

store_with_matching_url .+suiteid.+
follow_link_with_matching_url .+pagenum.+|.+suiteid.+

3. regarding the european sites, they may check the user agent, so you could try changing that to your browser's, or they may check a cookie.

A later video will cover issues 1 and 3. thanks

neil
ReplyDelete
Replies
AnonymousApril 26, 2011 at 4:52 AM
Can you give us some suggestions for common expressions?

I noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.html

Any ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.
ReplyDelete
Replies
Neil McGuiganApril 26, 2011 at 9:46 AM
@anonymous: regular expressions are powerful, but can definitely be tricky. I would recommend checking out this to learn by example:

http://www.regular-expressions.info/examples.html

and using this to learn by doing:

http://burkeware.com/software/regex_playground.html

good luck
ReplyDelete
Replies
AnonymousApril 27, 2011 at 5:09 PM
The burkeware link is awesome! Makes it so much easier to get regular expressions going. Thanks so much for sharing!
ReplyDelete
Replies
AnonymousJune 17, 2011 at 9:28 PM
This is incredibly informative. Thank you so much!
ReplyDelete
Replies
AnonymousJune 27, 2011 at 11:02 PM
how to use this to crawl google??? with other websites rapidminer so powerfull, but it can't handle google. Could you show me how ???
ReplyDelete
Replies
AnonymousSeptember 7, 2011 at 3:40 PM
I think Google disallow crawling their website. you can see that here http://www.google.com/robots.txt
ReplyDelete
Replies
AnonymousSeptember 25, 2011 at 7:11 AM
To the last commenter: Why would they tell you how to scrape their own website in their own documentation?

https://docs.google.com/support/bin/answer.py?answer=155184
ReplyDelete
Replies
AnonymousSeptember 28, 2011 at 2:55 AM
Neil, this is really useful. Thanks!
Is there anyway to save the page using it's 'Title' tag rather than 0,1,2,3,4....
ReplyDelete
Replies
YaroFebruary 6, 2012 at 5:09 PM
Hi Neil,

Any update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based.

I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment along with other vital information by processing about 20 html files using Rapidminer. I am still unsure as to how exactly to process the the html files. (My thought process is to convert to xml first and use xpath not sure yet?). I also just checked and each of the hyperlinks actually goes to a page where other information is available in the source code so I think Rapidminer would have no trouble grabbing the info if I fed all the URLS based on the "post-ajax" webpage.

Any help or suggestions are greatly appreciated especially on crawling ajax websites and whether or not there is a feature in Rapidminer that can simplify searching the html file. I am new to this so sorry If I've stated something wrong it's just what i've gathered from reading today.

Thanks for posting all of this!
ReplyDelete
Replies
ThomasFebruary 19, 2012 at 4:30 PM
i also can't seem to get certain sites to work. for example:
http://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg/-/-/-/EURO--1000,00

doesn't use cookies, doesn't seem to use ajax (maybe i am wrong about it?), i set the user agent properly, i uncheck "obey robot exclusion" and check "really ignore exclusion".
anyone knows why?
ReplyDelete
Replies
Neil McGuiganFebruary 21, 2012 at 9:58 AM
@Thomas, I am not sure exactly, though that page is quite large, and slow to open. Can you try upping the max size and timeout?

You could also try my ajax scraper (latest post as of today), even though it doesn't use ajax.
ReplyDelete
Replies
ManojMarch 17, 2012 at 1:10 PM
This comment has been removed by the author.
ReplyDelete
Replies
ManojMarch 17, 2012 at 1:21 PM
I have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner

//h:*[contains(.,'Departure airport:')]/../h:td[2]/a and also
//h:*[contains(.,'Departure airport:')]/../h:td[2]/a/text()
but no luck...can you suggest something.
----------------------------------------------------
tr
td class="caption">Departure airport: /td
td class="desc">VANCOUVER Airport Canada/td
/tr
-----------------------------------------------------
ReplyDelete
Replies
JackWilliamMarch 24, 2013 at 11:30 PM
Hi, i completed my UG in IT last and interested in web designing and seo facts. I am just a newbie to this term web crawling and rapid miner web. I read and gathered many basic info bout this from WebSpiders.biz. They just presented a good facts about web designing and crawling. Check out.....
ReplyDelete
Replies
SumithraJune 19, 2013 at 10:56 PM
Hello,
@ Neil sir,
I have followed all the steps of Your video as it is, but still am not able get the list of Webpages as output.In the result mode i only headings :"Row No. Link Document_source ".

please help me out.
ReplyDelete
Replies
AnonymousJune 20, 2013 at 6:34 AM
No filename given for result file, using stdout for logging results!
Process //NewLocalRepository/crawl_web_pages starts
PM INFO: Loading initial data.
Saving results.
Process //NewLocalRepository/crawl_web_pages finished successfully after 11 s

after reunnig the Crawl Web operator am getting above specified lines in Log and example set is null(i.e only column headings).plz help me out.
ReplyDelete
Replies
AnonymousDecember 18, 2013 at 1:00 PM
Excellent video,

I tried with my RapidMiner V5.3.015 and web Mining 5.3.1 which is checked.
But I can not find web mining extension in operators view as well as crawl web operator to use it.
Any ideas?
ReplyDelete
Replies

Pages

Monday, April 4, 2011

Web Crawling with RapidMiner

22 comments:

Archive