Thursday, August 25, 2011

RapidMiner ETL - Sampling, Selecting Rows, Attributes

In this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes.

You can get the dataset here


  1. I tend to push all of this work down to the database level. However, I can run into memory problems if I'm just selecting out (so you need to do a table read). The greater question is...with RapidMiner, is it better practice to deal with a database read directly into RapidMiner, or...a database read into a flat file, that is then read into RapidMiner...which is more efficient, and which is "good form".

  2. I like using RM for ETL as it's easy to test and save the process. It's easy to do etl in many programs, but how easy is it to reproduce and share?