Vancouver Data Blog by Neil McGuigan: Text Mining Performance in RapidMiner

Sunday, May 26, 2013

Text Mining Performance in RapidMiner

Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.

I set up Java to use 6500 MB of memory (max).

I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.

The text processing was purposefully simple: tokenize the document and get the binary word vector.

I then stored the results in the RapidMiner repository, which creates a binary file.

In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.

# Records	Time to process + store (s)	Peak memory (GB)	Stored results file size (MB)	Time to apply (s)
100	0	0.400	0.223	1
1,000	1	0.576	2.1	0
10,000	8	1.3	21	1
20000	15	2.4	42
30000	23	2.6	63
40000	30	2.9	84
50000	39	3.8	105	5
60000	48	4.0	126	5
70000	56	4.1	148
80000	66	4.5	168
90000	71	4.7	190
100,000	88	5.3	211

The store operator was much faster than the Write Database operator.

1 comment:

AnonymousOctober 21, 2013 at 7:31 PM
Dear Neil,

It is about a year that I am following you. Recently, I did a binary classification by Rapidminer and I applied three different algorithms (SVM, K-NN and Naive Bayesian). I got the results but my supervisor has asked me to report the threshold for the classification. How can I report the threshold that Rapidminer has used for each of them? I know we have an operator by name of Find Threshold (Meta). I was not sure that it was the correct one. It gave me the same threshold for algorithms! I will be grateful if you direct me in this case.

Regards,
Reza
ReplyDelete
Replies

Vancouver Data Blog by Neil McGuigan

Pages

Sunday, May 26, 2013

Text Mining Performance in RapidMiner

1 comment:

Archive