Comments on Vancouver Data Blog by Neil McGuigan: Web Crawling with RapidMiner

Excellent video, I tried with my RapidMiner V5.3....

2013-12-18T13:00:50.983-08:00

Excellent video,

I tried with my RapidMiner V5.3.015 and web Mining 5.3.1 which is checked.
But I can not find web mining extension in operators view as well as crawl web operator to use it.
Any ideas?

No filename given for result file, using stdout fo...

2013-06-20T06:34:39.681-07:00

No filename given for result file, using stdout for logging results!
Process //NewLocalRepository/crawl_web_pages starts
PM INFO: Loading initial data.
Saving results.
Process //NewLocalRepository/crawl_web_pages finished successfully after 11 s

after reunnig the Crawl Web operator am getting above specified lines in Log and example set is null(i.e only column

Hello, @ Neil sir, I have followed all the steps ...

2013-06-19T22:56:46.168-07:00

Hello,
@ Neil sir,
I have followed all the steps of Your video as it is, but still am not able get the list of Webpages as output.In the result mode i only headings :"Row No. Link Document_source ".

please help me out.

Hi, i completed my UG in IT last and interested in...

2013-03-24T23:30:39.018-07:00

Hi, i completed my UG in IT last and interested in web designing and seo facts. I am just a newbie to this term web crawling and rapid miner web. I read and gathered many basic info bout this from WebSpiders.biz. They just presented a good facts about web designing and crawling. Check out.....

I have this below XML wherein i wanted to pick up ...

2012-03-17T13:21:26.507-07:00

I have this below XML wherein i wanted to pick up the anchor tag "VANCOUVER Airport, Canada" instead of the URL. Tried using the following in RapidMiner

//h:*[contains(.,'Departure airport:')]/../h:td[2]/a and also
//h:*[contains(.,'Departure airport:')]/../h:td[2]/a/text()
but no luck...can you suggest something.
----------------------------

2012-03-17T13:10:38.654-07:00

This comment has been removed by the author.

@Thomas, I am not sure exactly, though that page i...

2012-02-21T09:58:37.102-08:00

@Thomas, I am not sure exactly, though that page is quite large, and slow to open. Can you try upping the max size and timeout?

You could also try my ajax scraper (latest post as of today), even though it doesn't use ajax.

i also can't seem to get certain sites to work...

2012-02-19T16:30:08.938-08:00

i also can't seem to get certain sites to work. for example:
http://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg/-/-/-/EURO--1000,00

doesn't use cookies, doesn't seem to use ajax (maybe i am wrong about it?), i set the user agent properly, i uncheck "obey robot exclusion" and check "really ignore exclusion".
anyone knows

Hi Neil, Any update on crawling AJAX websites? B...

2012-02-07T14:37:31.260-08:00

Hi Neil,

Any update on crawling AJAX websites? Best solution I have come up with is just saving the search pages I am interested manually as "completed webpages" using Mozilla/Chrome and going from there. I can display up to 500 search results at once for the particular website I am crawling but I can see this being a hastle if its capped at 50 results a page. I think the

Hi Neil, Any update on issue #1(crawling ajax bas...

2012-02-06T17:09:54.545-08:00

Hi Neil,

Any update on issue #1(crawling ajax based websites?) It seems the website I am interseted in crawling (www.uship.com/find) is ajax based.

I've noticed I can save the "post-ajax" webpage using chrome/mozilla "web complete". Luckily I can save up to 500 search results at a time. This way I think I can get the hyperlinks for each shipment

Neil, this is really useful. Thanks! Is there anyw...

2011-09-28T02:55:14.519-07:00

Neil, this is really useful. Thanks!
Is there anyway to save the page using it's 'Title' tag rather than 0,1,2,3,4....

To the last commenter: Why would they tell you how...

2011-09-25T07:11:54.097-07:00

To the last commenter: Why would they tell you how to scrape their own website in their own documentation?

https://docs.google.com/support/bin/answer.py?answer=155184

I think Google disallow crawling their website. yo...

2011-09-07T15:40:40.587-07:00

I think Google disallow crawling their website. you can see that here http://www.google.com/robots.txt

how to use this to crawl google??? with other webs...

2011-06-27T23:02:35.490-07:00

how to use this to crawl google??? with other websites rapidminer so powerfull, but it can't handle google. Could you show me how ???

This is incredibly informative. Thank you so much!...

2011-06-17T21:28:38.987-07:00

This is incredibly informative. Thank you so much!

The burkeware link is awesome! Makes it so much ea...

2011-04-27T17:09:13.603-07:00

The burkeware link is awesome! Makes it so much easier to get regular expressions going. Thanks so much for sharing!

@anonymous: regular expressions are powerful, but ...

2011-04-26T09:46:19.034-07:00

@anonymous: regular expressions are powerful, but can definitely be tricky. I would recommend checking out this to learn by example:

http://www.regular-expressions.info/examples.html

and using this to learn by doing:

http://burkeware.com/software/regex_playground.html

good luck

Can you give us some suggestions for common expres...

2011-04-26T04:52:42.865-07:00

Can you give us some suggestions for common expressions?

I noticed that with .+key.+ I only get sites wich are http:/www.meh.com/mehh/key.html or http:/www.meh.com/mehh/mehh/key.html but not http:/www.meh.com/mehh/key/site/mehhh/index.html

Any ideas for how to use common expressions on urls? It seems quite tricky for someone new to this.

1. if a website uses AJAX, you have to crawl it di...

2011-04-11T17:21:08.569-07:00

1. if a website uses AJAX, you have to crawl it differently.

2. these were my crawling rules:

store_with_matching_url .+suiteid.+
follow_link_with_matching_url .+pagenum.+|.+suiteid.+

3. regarding the european sites, they may check the user agent, so you could try changing that to your browser's, or they may check a cookie.

A later video will

Can you tell me what you wrote in the crawling rul...

2011-04-11T14:16:27.578-07:00

Can you tell me what you wrote in the crawling rules.I can´t see clearly.

Please can you tell me why I can't crawl other...

2011-04-05T23:28:05.456-07:00

Please can you tell me why I can't crawl other websites e.g (yahoo, Google..etc.)? I followed your video and it's just worked perfect on the the site that you used but if I did the same thing with another sites it dose not work. can you please help me?
Thank you so much for the valuable information.

Nice!

2011-04-05T08:49:21.019-07:00

Nice!