Vancouver Data Blog by Neil McGuigan

Friday, August 5, 2016

Most of my blogging is on database-patterns.blogspot.com now

Go to https://database-patterns.blogspot.com/

Tuesday, July 30, 2013

JMP 11 statistics sneak peek

Sunday, May 26, 2013

Text Mining Performance in RapidMiner

Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.

I set up Java to use 6500 MB of memory (max).

I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.

The text processing was purposefully simple: tokenize the document and get the binary word vector.

I then stored the results in the RapidMiner repository, which creates a binary file.

In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.

# Records	Time to process + store (s)	Peak memory (GB)	Stored results file size (MB)	Time to apply (s)
100	0	0.400	0.223	1
1,000	1	0.576	2.1	0
10,000	8	1.3	21	1
20000	15	2.4	42
30000	23	2.6	63
40000	30	2.9	84
50000	39	3.8	105	5
60000	48	4.0	126	5
70000	56	4.1	148
80000	66	4.5	168
90000	71	4.7	190
100,000	88	5.3	211

The store operator was much faster than the Write Database operator.

Thursday, May 16, 2013

AWS Redshift: How Amazon Changed The Game

A good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs:

http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/

Thursday, April 18, 2013

Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500

I'll be teaching a RapidMiner course here in Vancouver next week:

Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)

Details here:

http://rapid-i_us_20130423-eorg.eventbrite.com/

Save $500 with the coupon VAN_BLOG !

Tuesday, February 12, 2013

Google's Data Mining Research Papers

In case you missed it, here are Google's 104 data mining research papers:

http://research.google.com/pubs/DataMining.html

Thursday, December 20, 2012

The Google F1 slides

Google F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :)

Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.

http://www.stanford.edu/class/cs347/slides/f1.pdf