Pages

Friday, August 5, 2016

Most of my blogging is on database-patterns.blogspot.com now

Go to https://database-patterns.blogspot.com/

Tuesday, July 30, 2013

Sunday, May 26, 2013

Text Mining Performance in RapidMiner

Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.

I set up Java to use 6500 MB of memory (max).

I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.

The text processing was purposefully simple: tokenize the document and get the binary word vector.

I then stored the results in the RapidMiner repository, which creates a binary file.

In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.


# Records
Time to process + store (s)
Peak memory (GB)
Stored results file size (MB)
Time to apply (s)
100
0
0.400
0.223
1
1,000
1
0.576
2.1
0
10,000
8
1.3
21
1
20000
15
2.4
42

30000
23
2.6
63

40000
30
2.9
84

50000
39
3.8
105
5
60000
48
4.0
126
5
70000
56
4.1
148

80000
66
4.5
168

90000
71
4.7
190

100,000
88
5.3
211


The store operator was much faster than the Write Database operator.

Thursday, May 16, 2013

AWS Redshift: How Amazon Changed The Game

A good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs: 

http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/

Thursday, April 18, 2013

Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500


I'll be teaching a RapidMiner course here in Vancouver next week:

Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)

Details here:

http://rapid-i_us_20130423-eorg.eventbrite.com/

Save $500 with the coupon VAN_BLOG !

Tuesday, February 12, 2013

Google's Data Mining Research Papers

In case you missed it, here are Google's 104 data mining research papers:

http://research.google.com/pubs/DataMining.html


Thursday, December 20, 2012

The Google F1 slides

Google F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :)

Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.

http://www.stanford.edu/class/cs347/slides/f1.pdf