Vancouver Data Blog by Neil McGuigan

tag:blogger.com,1999:blog-25238191815637160592025-05-24T06:16:16.992-07:00Vancouver Data Blog by Neil McGuiganSome RapidMiner, some JMP, some Google DocsNeil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.comBlogger64125tag:blogger.com,1999:blog-2523819181563716059.post-37457876464969932962016-08-05T15:35:00.001-07:002024-04-04T10:30:20.842-07:00Most of my blogging is on database-patterns.blogspot.com now

Go to https://database-patterns.blogspot.com/

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-15133703729127769982013-07-30T14:25:00.002-07:002013-07-30T14:25:53.291-07:00JMP 11 statistics sneak peek

JMP 11 Sneak Peak just came out today.

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-30135445047008350892013-05-26T23:02:00.001-07:002013-05-26T23:05:15.808-07:00Text Mining Performance in RapidMiner

Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results. I set up Java to use 6500 MB of memory (max). I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length. The text processing was purposefully simple: tokenize the document and get the binary word vector. I then stored the

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-21307925097135014262013-05-16T12:41:00.001-07:002013-05-16T12:41:19.697-07:00AWS Redshift: How Amazon Changed The Game

A good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs:  http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-74403136079293801402013-04-18T22:53:00.001-07:002013-04-18T22:53:48.886-07:00Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500

I'll be teaching a RapidMiner course here in Vancouver next week:Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)Details here:http://rapid-i_us_20130423-eorg.eventbrite.com/Save $500 with the coupon VAN_BLOG !

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-87836761313946042832013-02-12T17:56:00.000-08:002013-02-12T17:56:00.998-08:00Google's Data Mining Research Papers

In case you missed it, here are Google's 104 data mining research papers: http://research.google.com/pubs/DataMining.html

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com4tag:blogger.com,1999:blog-2523819181563716059.post-54684192301038659512012-12-20T20:15:00.000-08:002012-12-20T20:15:51.932-08:00The Google F1 slides

Google F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :) Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords. http://www.stanford.edu/class/cs347/slides/f1.pdf

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-48016349309851731302012-11-01T22:56:00.001-07:002012-11-01T22:56:46.276-07:00Chomsky on Where AI Went Wrong

If one were to rank a list of civilization's greatest and most elusive intellectual challenges, the problem of "decoding" ourselves -- understanding the inner workings of our minds and our brains, and how the architecture of these elements is encoded in our genome -- would surely be at the top. Yet the diverse fields that took on this challenge, from philosophy and psychology to computer science

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-64179316915169408912012-11-01T22:54:00.002-07:002012-11-01T22:54:49.872-07:00The father of fractals

A nice little piece on Mandlebrot in the Economist: http://www.economist.com/node/2246127

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-75888556499413130062012-09-26T13:53:00.002-07:002012-09-26T13:53:32.489-07:00As I predicted, Self-driving cars a reality for 'ordinary people' within 5 years, says Google's Sergey Brin

Link here: http://www.computerworld.com/s/article/9231707/Self_driving_cars_a_reality_for_39_ordinary_people_39_within_5_years_says_Google_39_s_Sergey_Brin

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-47558001895530456562012-09-26T13:13:00.001-07:002012-09-26T13:15:16.321-07:00The Google Spanner Paper

Google spanner is a massively distributed database. It needs atomic clocks on each machine to work though... http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-44254665515552688572012-09-07T20:30:00.002-07:002012-10-27T12:47:51.752-07:00The Google Dremel Paper

Here is the paper describing Google Dremel, which may replace Hive one day. There does not seem to be anyone working on an open-source version though Link (PDF) Update: Apache Drill is the open source version of Dremel (hat tip to Zoltan). Also, Cloudera's Impala looks simlar.

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com3tag:blogger.com,1999:blog-2523819181563716059.post-57449266022608089312012-09-07T20:22:00.001-07:002012-09-07T20:23:07.274-07:00Self-driving cars: The next revolution

Here is a recent report from KPMG about self-driving cars: http://www.kpmg.com/US/en/IssuesAndInsights/ArticlesPublications/Documents/self-driving-cars-next-revolution.pdf

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-9945900530963868892012-08-07T17:36:00.000-07:002012-10-24T23:09:54.200-07:00Google’s Self-Driving Cars Are Going to Change Everything

Recent News: Google’s Self-Driving Cars Complete 300K Miles Without Accident, Deemed Ready For Commuting http://techcrunch.com/2012/08/07/google-cars-300000-miles-without-accident/ Here's what is going to happen in the next 5-10 years. It won't all happen right away. The car insurance industry will cease to exist. These cars aren't going to crash. Even if there are hold-outs that drive

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com56tag:blogger.com,1999:blog-2523819181563716059.post-64603116867277021642012-02-11T20:36:00.000-08:002012-02-12T20:34:44.357-08:00Less Painful AJAX / Javascript Web Scraping

If you read my previous post, you'll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder. Here's the how-to video: You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com9tag:blogger.com,1999:blog-2523819181563716059.post-36796472733513387462012-02-09T16:01:00.000-08:002012-06-11T21:12:23.865-07:00Web Scraping AJAX Pages

This is part four of a series of video tutorials on web scraping and web crawling. You can probably skip this one, and go to the easy version. Part 1: Web scraping with Google Spreadsheets and XPath Part 2: Web Crawling with RapidMiner Part 3: Web Scraping with RapidMiner and Xpath This post explains how to capture HTML from Ajax / Javascript generated pages. Here is the

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com2tag:blogger.com,1999:blog-2523819181563716059.post-81796989286280772732012-01-29T17:59:00.000-08:002012-01-29T17:59:03.714-08:00On Making Videos

Here is what i use to make my videos: 1. CamStudio. This is a nice free and open-source desktop video capture program. Make sure to use their Lossless Codec, and go with these settings: Set Keyframes Every 30 frames Capture Frames Every = 50 milliseconds Playback Rate = 20 frames per second Video codec: CamStudio Lossless Codec Quality: 70% 2. Handbrake Video Transcoder. This will help

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-38788392140826034332011-12-31T19:38:00.003-08:002011-12-31T19:38:49.203-08:00Happy New Year

75,000 pageviews this year! Thanks to everyone for visiting. I will post some new material in the new year. Have a safe and fun 2012 Neil

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-56793975717505878832011-11-04T14:39:00.001-07:002012-09-18T23:27:10.651-07:00My new blog about learning ExtJS

I have a new blog. It's about learning to use ExtJS, a great rich internet application library in javascript. Here it is: http://extjs-tutorials.blogspot.com/ Check it out. Thanks Don't worry, I'll keep posting here too

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-66336769954358035112011-10-09T22:26:00.000-07:002011-10-09T22:26:31.748-07:00How Obama's data-crunching prowess may get him re-elected

An article on CNN about how the Obama 2012 campaign has hired many data miners and statisticians to help boost fundraising and support. http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-19398004315615338902011-10-08T15:31:00.001-07:002012-11-06T12:15:18.679-08:00Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents

After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series. I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents. Files from the video.

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com19tag:blogger.com,1999:blog-2523819181563716059.post-65123688030198565632011-09-02T21:06:00.000-07:002011-09-02T21:05:54.648-07:00September sunset

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com1tag:blogger.com,1999:blog-2523819181563716059.post-11287447309282466562011-08-27T20:01:00.002-07:002011-08-28T11:10:13.003-07:00RapidMiner ETL - Transforming Attributes with Functions

In this video I show how to transform features in RapidMiner using operators such as log, sqrt, absolute value, and multiplying columns.

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com4tag:blogger.com,1999:blog-2523819181563716059.post-62083404987801045962011-08-27T20:01:00.000-07:002011-08-28T11:06:25.178-07:00RapidMiner ETL - Normalizing, Discretizing, Recoding

In this video I show how to normalize an attribute, including z-normalization, how to discretize a column, and how to recode values

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com0tag:blogger.com,1999:blog-2523819181563716059.post-43668781700887997102011-08-25T18:18:00.000-07:002011-08-26T10:43:23.852-07:00RapidMiner ETL - Sampling, Selecting Rows, Attributes

In this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes. You can get the dataset here

Neil McGuiganhttp://www.blogger.com/profile/14122981831780837323noreply@blogger.com2