tag:blogger.com,1999:blog-25238191815637160592024-03-17T20:03:40.822-07:00Vancouver Data Blog by Neil McGuiganSome RapidMiner, some JMP, some Google DocsNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.comBlogger64125tag:blogger.com,1999:blog-2523819181563716059.post-37457876464969932962016-08-05T15:35:00.001-07:002016-08-05T15:35:16.813-07:00Most of my blogging is on databasepatterns.com nowGo to http://blog.databasepatterns.com Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-15133703729127769982013-07-30T14:25:00.002-07:002013-07-30T14:25:53.291-07:00JMP 11 statistics sneak peekJMP 11 Sneak Peak just came out today.Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-30135445047008350892013-05-26T23:02:00.001-07:002013-05-26T23:05:15.808-07:00Text Mining Performance in RapidMinerDid load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.
I set up Java to use 6500 MB of memory (max).
I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.
The text processing was purposefully simple: tokenize the document and get the binary word vector.
I then stored the Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-21307925097135014262013-05-16T12:41:00.001-07:002013-05-16T12:41:19.697-07:00AWS Redshift: How Amazon Changed The GameA good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs:
http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-74403136079293801402013-04-18T22:53:00.001-07:002013-04-18T22:53:48.886-07:00Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500I'll be teaching a RapidMiner course here in Vancouver next week:Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)Details here:http://rapid-i_us_20130423-eorg.eventbrite.com/Save $500 with the coupon VAN_BLOG !Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-87836761313946042832013-02-12T17:56:00.000-08:002013-02-12T17:56:00.998-08:00Google's Data Mining Research PapersIn case you missed it, here are Google's 104 data mining research papers:
http://research.google.com/pubs/DataMining.html
Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com4tag:blogger.com,1999:blog-2523819181563716059.post-54684192301038659512012-12-20T20:15:00.000-08:002012-12-20T20:15:51.932-08:00The Google F1 slidesGoogle F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :)
Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.
http://www.stanford.edu/class/cs347/slides/f1.pdf
Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-48016349309851731302012-11-01T22:56:00.001-07:002012-11-01T22:56:46.276-07:00Chomsky on Where AI Went WrongIf one were to rank a list of civilization's greatest and most elusive intellectual challenges, the problem of "decoding" ourselves -- understanding the inner workings of our minds and our brains, and how the architecture of these elements is encoded in our genome -- would surely be at the top. Yet the diverse fields that took on this challenge, from philosophy and psychology to computer science Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-64179316915169408912012-11-01T22:54:00.002-07:002012-11-01T22:54:49.872-07:00The father of fractalsA nice little piece on Mandlebrot in the Economist:
http://www.economist.com/node/2246127Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-75888556499413130062012-09-26T13:53:00.002-07:002012-09-26T13:53:32.489-07:00As I predicted, Self-driving cars a reality for 'ordinary people' within 5 years, says Google's Sergey BrinLink here:
http://www.computerworld.com/s/article/9231707/Self_driving_cars_a_reality_for_39_ordinary_people_39_within_5_years_says_Google_39_s_Sergey_BrinNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-47558001895530456562012-09-26T13:13:00.001-07:002012-09-26T13:15:16.321-07:00The Google Spanner PaperGoogle spanner is a massively distributed database. It needs atomic clocks on each machine to work though...
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdfNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-44254665515552688572012-09-07T20:30:00.002-07:002012-10-27T12:47:51.752-07:00The Google Dremel PaperHere is the paper describing Google Dremel, which may replace Hive one day. There does not seem to be anyone working on an open-source version though
Link (PDF)
Update: Apache Drill is the open source version of Dremel (hat tip to Zoltan).
Also, Cloudera's Impala looks simlar.Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com3tag:blogger.com,1999:blog-2523819181563716059.post-57449266022608089312012-09-07T20:22:00.001-07:002012-09-07T20:23:07.274-07:00Self-driving cars: The next revolutionHere is a recent report from KPMG about self-driving cars:
http://www.kpmg.com/US/en/IssuesAndInsights/ArticlesPublications/Documents/self-driving-cars-next-revolution.pdfNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-9945900530963868892012-08-07T17:36:00.000-07:002012-10-24T23:09:54.200-07:00Google’s Self-Driving Cars Are Going to Change EverythingRecent News:
Google’s Self-Driving Cars Complete 300K Miles Without Accident, Deemed Ready For Commuting
http://techcrunch.com/2012/08/07/google-cars-300000-miles-without-accident/
Here's what is going to happen in the next 5-10 years. It won't all happen right away.
The car insurance industry will cease to exist. These cars aren't going to crash. Even if there are hold-outs that drive Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com56tag:blogger.com,1999:blog-2523819181563716059.post-64603116867277021642012-02-11T20:36:00.000-08:002012-02-12T20:34:44.357-08:00Less Painful AJAX / Javascript Web ScrapingIf you read my previous post, you'll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder.
Here's the how-to video:
You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com10tag:blogger.com,1999:blog-2523819181563716059.post-36796472733513387462012-02-09T16:01:00.000-08:002012-06-11T21:12:23.865-07:00Web Scraping AJAX PagesThis is part four of a series of video tutorials on web scraping and web crawling.
You can probably skip this one, and go to the easy version.
Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
This post explains how to capture HTML from Ajax / Javascript generated pages.
Here is the Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com2tag:blogger.com,1999:blog-2523819181563716059.post-81796989286280772732012-01-29T17:59:00.000-08:002012-01-29T17:59:03.714-08:00On Making VideosHere is what i use to make my videos:
1. CamStudio. This is a nice free and open-source desktop video capture program. Make sure to use their Lossless Codec, and go with these settings:
Set Keyframes Every 30 frames
Capture Frames Every = 50 milliseconds
Playback Rate = 20 frames per second
Video codec: CamStudio Lossless Codec
Quality: 70%
2. Handbrake Video Transcoder. This will helpNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-38788392140826034332011-12-31T19:38:00.003-08:002011-12-31T19:38:49.203-08:00Happy New Year75,000 pageviews this year! Thanks to everyone for visiting. I will post some new material in the new year.
Have a safe and fun 2012
NeilNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-56793975717505878832011-11-04T14:39:00.001-07:002012-09-18T23:27:10.651-07:00My new blog about learning ExtJSI have a new blog. It's about learning to use ExtJS, a great rich internet application library in javascript. Here it is:
http://extjs-tutorials.blogspot.com/
Check it out. Thanks
Don't worry, I'll keep posting here tooNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-66336769954358035112011-10-09T22:26:00.000-07:002011-10-09T22:26:31.748-07:00How Obama's data-crunching prowess may get him re-electedAn article on CNN about how the Obama 2012 campaign has hired many data miners and statisticians to help boost fundraising and support.
http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-19398004315615338902011-10-08T15:31:00.001-07:002012-11-06T12:15:18.679-08:00Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New DocumentsAfter my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.
I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.
Files from the video.Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com19tag:blogger.com,1999:blog-2523819181563716059.post-65123688030198565632011-09-02T21:06:00.000-07:002011-09-02T21:05:54.648-07:00September sunsetNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-11287447309282466562011-08-27T20:01:00.002-07:002011-08-28T11:10:13.003-07:00RapidMiner ETL - Transforming Attributes with FunctionsIn this video I show how to transform features in RapidMiner using operators such as log, sqrt, absolute value, and multiplying columns.
Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com4tag:blogger.com,1999:blog-2523819181563716059.post-62083404987801045962011-08-27T20:01:00.000-07:002011-08-28T11:06:25.178-07:00RapidMiner ETL - Normalizing, Discretizing, RecodingIn this video I show how to normalize an attribute, including z-normalization, how to discretize a column, and how to recode values
Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-43668781700887997102011-08-25T18:18:00.000-07:002011-08-26T10:43:23.852-07:00RapidMiner ETL - Sampling, Selecting Rows, AttributesIn this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes.
You can get the dataset hereNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com2