Vancouver Data Blog by Neil McGuigan
Some RapidMiner, some JMP, some Google Docs
Friday, August 5, 2016
Tuesday, July 30, 2013
Sunday, May 26, 2013
Text Mining Performance in RapidMiner
Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.
I set up Java to use 6500 MB of memory (max).
I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.
The text processing was purposefully simple: tokenize the document and get the binary word vector.
I then stored the results in the RapidMiner repository, which creates a binary file.
In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.
The store operator was much faster than the Write Database operator.
I set up Java to use 6500 MB of memory (max).
I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.
The text processing was purposefully simple: tokenize the document and get the binary word vector.
I then stored the results in the RapidMiner repository, which creates a binary file.
In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.
# Records
|
Time to process + store (s)
|
Peak memory (GB)
|
Stored results file size (MB)
|
Time to apply (s)
|
100
|
0
|
0.400
|
0.223
|
1
|
1,000
|
1
|
0.576
|
2.1
|
0
|
10,000
|
8
|
1.3
|
21
|
1
|
20000
|
15
|
2.4
|
42
| |
30000
|
23
|
2.6
|
63
| |
40000
|
30
|
2.9
|
84
| |
50000
|
39
|
3.8
|
105
|
5
|
60000
|
48
|
4.0
|
126
|
5
|
70000
|
56
|
4.1
|
148
| |
80000
|
66
|
4.5
|
168
| |
90000
|
71
|
4.7
|
190
| |
100,000
|
88
|
5.3
|
211
|
The store operator was much faster than the Write Database operator.
Thursday, May 16, 2013
AWS Redshift: How Amazon Changed The Game
A good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs:
http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/
http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/
Thursday, April 18, 2013
Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500
I'll be teaching a RapidMiner course here in Vancouver next week:
Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)
Details here:
http://rapid-i_us_20130423-eorg.eventbrite.com/
Save $500 with the coupon VAN_BLOG !
Tuesday, February 12, 2013
Google's Data Mining Research Papers
In case you missed it, here are Google's 104 data mining research papers:
http://research.google.com/pubs/DataMining.html
http://research.google.com/pubs/DataMining.html
Thursday, December 20, 2012
The Google F1 slides
Google F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :)
Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.
http://www.stanford.edu/class/cs347/slides/f1.pdf
Basically, it's a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.
http://www.stanford.edu/class/cs347/slides/f1.pdf
Subscribe to:
Posts (Atom)