I set up Java to use 6500 MB of memory (max).
I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length.
The text processing was purposefully simple: tokenize the document and get the binary word vector.
I then stored the results in the RapidMiner repository, which creates a binary file.
In a different process, I then read the stored results and applied a Naive Bayes model to them. I didn't do all of them, but there wasn't much difference. As you can see, the model application is quite fast.
# Records
|
Time to process + store (s)
|
Peak memory (GB)
|
Stored results file size (MB)
|
Time to apply (s)
|
100
|
0
|
0.400
|
0.223
|
1
|
1,000
|
1
|
0.576
|
2.1
|
0
|
10,000
|
8
|
1.3
|
21
|
1
|
20000
|
15
|
2.4
|
42
| |
30000
|
23
|
2.6
|
63
| |
40000
|
30
|
2.9
|
84
| |
50000
|
39
|
3.8
|
105
|
5
|
60000
|
48
|
4.0
|
126
|
5
|
70000
|
56
|
4.1
|
148
| |
80000
|
66
|
4.5
|
168
| |
90000
|
71
|
4.7
|
190
| |
100,000
|
88
|
5.3
|
211
|
The store operator was much faster than the Write Database operator.