Monday, December 27, 2010

10,000 views on my Youtube videos, 5,000 on my blog. Thanks everyone!

Blown away with the number of visitors!

While I'm here, here's a good article from Wired magazine about AI:

The A.I. Revolution Is On

Thursday, December 16, 2010

Next video series: Web crawling and scraping

I'll be working on a video series on web crawling and scraping over Christmas, for release at the end of December or the first week of January at the latest.

Web crawling and social network analysis were neck and neck on the poll, with social slightly ahead, but I am going to do web crawling first as I'm working on a web crawling project, so it will be fresher in my mind.

Tuesday, December 14, 2010

How to Filter By Value in RapidMiner

Use the Filter Examples operator on an exampleset (Data Transformation > Filtering)

set condition class to attribute_value_filter

set parameter string like attribute=value




Category=customer service|healthcare

  • Both attribute and value are case-sensitive
  • Spaces are allowed
  • Use the | character for the "or" operator.

Custom stemming dictionary

You can create your own stemming dictionary in RapidMiner.

Add the Text Processing -> Stemming -> Stem (Dictionary) operator, and choose your dictionary file (plain text).

Your format should be like this:




will turn fished into fish.

You can also use wildcards:


will turn fished, fishes, fishing or anything beginning with fish into fish.

You should put longer versions of similar words at the top. For example, to stem these words correctly:

computer, computerise, computerize, computerized, computerised, computers, compute, computed, computes

You should use


and not


assuming computer and compute are not the same stem.

A regular expression to find "word A near word B" in RapidMiner

You can use the Text Processing->Extract Information operator to match regular expressions.

If you put the Extract Information operator inside a Process Documents operator, it will add a column to your dataset with the results of the match. Turn on "add meta information" option on the Process Documents operator.

Here's a simple regular expression to find a word near another word:


this will produce a match if "word1" has no more than "max" words between it and "word2". Example:

"The quick brown fox jumped over the lazy dog"

(quick\W+(?:\w+\W+){1,5}?lazy) will match, but

(quick\W+(?:\w+\W+){1,5}?dog) will not (it's has 6 words in between)

Saturday, December 11, 2010

So long old media, hello new media

"Financial services company Standard & Poor's announced changes yesterday to its S&P 500 stock market index, a widely respected compendium of large-cap U.S. public companies: Netflix is on the list for the first time. In the same announcement, rather poignantly, S&P announced that newspaper giant The New York Times Co. has been demoted to its index of mid-size companies (the MidCap 400), as has photography equipment manufacturer Eastman Kodak."

Wednesday, December 8, 2010

Which RapidMiner videos would you like to see next?

There's a poll on the top right of my blog. Let me know what you'd like to see next! Voting ends in one week.

Add comments to this post if you want something that's not on the list.



Tuesday, November 30, 2010

In Fridays's Globe & Mail: The algorithm method - Programming our lives away

A non-technical look at data mining

"Increasingly, algorithms are used to determine whether we can get access to credit, insurance and government services. They are posing a challenge to human decision-making in the arts. They are being used by prospective employers to decide if we should be hired. They can determine whether your online business will succeed or fail, and they have revolutionized the world of high finance."

Hat tip to Ron C!

Thursday, November 18, 2010

Indeed Job Trends: More C# jobs than C++

Indeed is a slick job board. I'm sure I am behind the times, but I just found their job trends application, much like Google Trends.

Here is C++ versus Java versus C# (USA only):

And some others...

Friday, November 12, 2010

Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization

This is the final second-to-last installment of a six-part series on text mining in RapidMiner. This video describes how to automatically categorize documents. This could be useful for a research project, or say finance.

You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.

Topics covered:
  • Cross-validation
  • The nearest neighbor learning algorithm
  • The naive bayes learning algorithm

Here is part 6

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.

Thanks for watching. Leave a comment for what you'd like to see next!

Also, check out the awesome RapidMiner finance videos on Neural Market Trends.

Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering

Thanks for watching.

This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.

In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.

Topics covered:
  • creating a word vector and calculating the terms' TF-IDF scores
  • calculating the similarity between documents using their cosine similarity
  • clustering documents using the K-Means algorithm

If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel

Up next, automatically categorizing documents.

Thursday, November 11, 2010

Graph of the month. "Analytics" versus "Data Mining" on Google Trends

They mean the same thing, but it looks like the Analytics name has caught on.

Data Mining

Wednesday, November 10, 2010

The Data Analytics Boom, in Forbes

Well, if it's in Forbes, it must be true.

Why analytics is taking off right now:
  1. Total Quality and Six-Sigma taught statistics to engineers.
  2. Goldman Sachs made mad money with statistical finance. This is a good read.
  3. There's a crap-load of data available now. 2 exabytes a day of new data, though frankly it's mostly cat videos and Bieber tweets.
  4. Cheap computers, cheap cloud computing, and good open source software like R and RapidMiner
  5. read more below...

Frankly, I think Competing on Analytics had a fair amount to do with it, as did the million dollar Netflix Prize.

How Canada became an open data and data journalism powerhouse

Open data and data journalism are blowing up.

Expect some good open data analysis coming up here soon!

Text Analytics With Rapidminer Part 3 of 6 - Association Rule Learning

Thanks for watching, and welcome Reddit!

This is part three of a six-part series on text mining in RapidMiner. This video describes how to find association rules in a collection of documents. An example would be if a job posting includes "data" and "mining" then it is also likely to include "RapidMiner". This is known as market basket analysis when applied to grocery stores :)

In this example, it can be useful for finding phrases and concepts that are important to job recruiters. You can use these phrases and concepts in your cover letter and resume, and increase your chances of getting them read.

Topics covered:
  • reading documents from a database
  • processing the text
  • creating a word vector
  • finding frequent itemsets using the FP-Growth algorithm
  • finding association rules
  • visualizing association rules

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel

Up next, calculating the similarity between documents.

Tuesday, November 9, 2010

Text Analytics With Rapidminer Part 2 of 6 - Processing Text

Wow, several hundred hits yesterday, thanks for watching everyone!

This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:
  • reading hundreds of documents from a database into RapidMiner
  • stripping HTML from content from a popular job posting board
  • tokenizing a document (splitting it into words)
  • removing overly common words (stopwords)
  • finding roots of words (stemming)
  • finding phrases in documents (n-grams)
  • and generating a word frequency table that describes important words in your documents
If you're not familiar with RapidMiner, you can see my other videos on my Youtube Channel.

Here's the video:

Your feedback and sharing is appreciated!

Up tomorrow, finding association rules in documents.

Monday, November 8, 2010

Join the International Open Data Hackathon, Dec 4

This is gonna be big. 37 cities signed up already. Add your app ideas and sign up here:

Memphis cuts crime 31% with predictive analytics

Can Vancouver do the same?

I haven't seen much on what the VPD does, do you have any links to share?

Here's the story

Found an answer:

"Vancouver-based Police Records Information Management Environment (PRIME) Inc. will start using IBM’s entity analytics to cut duplicates in its province-wide records sharing platform. How Canada’s privacy regulations make for fertile ground for IBM’s anonymous analytics"


Text Analytics with RapidMiner Part 1 of 6 - Loading Text

I'll be releasing a new video on text mining with RapidMiner every day this week.

They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!

Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.

*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.

Later this week:

Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.

Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.

Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.

Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.

NEW: Applying A Model To New Documents

Hope you enjoy them.

See my other data mining videos here

Saturday, November 6, 2010

A five part video series on text mining with RapidMiner starts Monday

Stay tuned.

There will be five videos, with a sample application based on a popular job posting board:

Friday, October 22, 2010

RapidMiner 5 Tutorial, Video 2 - Running RapidMiner for the First Time

Migrating some of my YouTube videos over to my blog, so I can add some text notes.

This video discusses what happens when you first run RapidMiner, including creating a repository, checking for updates, adding extensions,  and where RapidMiner keeps its associated files.

Tuesday, September 28, 2010

Using Custom WMS Maps in JMP 9

Man, this took a while to figure out. I almost had to bust out WireShark to get things working.

Graphs in JMP 9 can use the new Background Map feature. If you have an X co-ordinate and a Y co-ordinate on a graph (or your table has place names that reference co-ordinates in another table), you can use various map servers to provide a background map for your graph.

JMP 9 comes with two built in world maps, and the NASA maps seem to be working now. NASA maps are good enough for rough city level data. The built-in maps are not very detailed. For example, you can't zoom down to city level, at least in Canada.

So, I decided to use one of the Canadian federal government's free Web Map Service servers to get a detailed map of Vancouver.

I'll throw up a video soon, but for now, here are the basic steps to use a custom WMS map of Vancouver in JMP 9:
  1. Open a data table that has co-ordinate information in it (ie longitudes and latitudes).
  2. Open Graph Builder and drag your longitude column to the bottom of the graph, and your latitude column to the left of the graph.
  3. Right click your graph, and choose Graph - Background Map...
  4. Select Web Map Service, and set the URL to this:

and set the Layer field to feature_names,hydrography,vegetation

[You can see the full version of the map here]

The above is Natural Resource Canada's WMS map server. It only works for Canadian co-ordinates. Other WMS servers are available. And now you have a nice detailed map of Vancouver.

JMP 9 would be better if it saved your custom WMS URLs, showed error responses from the WMS server, and showed a selectable list of all the available layers. There is now a JMP add-in called "WMS Explorer" that does this.

I'll post more soon about integrating Vancouver data with maps.

Thursday, September 23, 2010

Using JMP 9 and R Together

An R Graph in JMP

JMP 9 is now able to talk to the popular open source statistical software R.

JMP Scripting Language (JSL) is what allows JMP and R to work together.

Here are the most useful JSL functions for R:

//initialize the R session: 

//send a JMP object to R:
int RSend(JMPObject)

//get an object from R:
object = RGet(RVariableName)

//submit R code to R for processing: 
int RSubmit(RCode)

//get the last graph made by R:
picture = RGetGraphics(format)

//terminate the session: 
int RTerm() 

Here is a simple program:
  • Open a sample table in JMP
  • Send the data to R
  • Make a scatterplot matrix in R
  • Return the graph to JMP and display it

//open a JMP table:
dt = Open("$SAMPLE_DATA/Big Class.JMP", invisible(true));

//initialize the R session (R must be installed on your computer):

//send the JMP data table to R:
result = RSend(dt);

//execute the R code. Make a scatterplot matrix:
RCode = "plot(dt)";
result = RSubmit(RCode);

//get the plotted graphic from R:
graphic = RGetGraphics(png);

//plot the graphic in JMP:
NewWindow("Graph from R", PictureBox(graphic));

//terminate the session:
result = RTerm();

Easy huh? The graphic at the top of this posting was done via the above code.

Wednesday, September 22, 2010

JMP 9 Review and Video Preview

Got myself a copy of JMP 9, fresh off the press. I'll give you some details about the software, and walk through the new features with some videos.

Download & Installation

   1. Download size is 352 MB for Windows, 236 MB for mac.
   2. 32 and 64 bit Windows is supported
   3. A 30 day trial will be available.
   4. Installed size is 325 MB on Windows, with one language, and none of the add-ons
   5. Online activation is required
   6. There is a new product called JMP Pro that includes extra features, such as boosted trees.
   7. Language support includes:
         1. English
         2. Japanese
         3. Simplified Chinese
         4. French
         5. German
         6. Italian

Starting JMP 9

   1. The first time you start it, JMP 9 will ask you if you want to import your customized menus from previous versions of JMP, which is nice of em.
   2. There are some changes to the main window and tool bars. See screenshot below
   3. You can find new JMP add-ins at
   4. Windows are “non-modal” now. For example, if you minimize all windows, your JMP starter will be its own icon on the task bar

The JMP Home Window

Initial Thoughts
  • The Excel Add-In Profiler could be simpler by setting the output only field, and automatically using all input fields related to that function. This is what DecisionTools TopRank tool does. 
  • I like the preview windows in the Home Window, but I find switching between windows a bit confusing. I'm sure I'll figure it out.
  • R integration in JSL seems straightforward. I'll put up a video soon.
  • Not sure I like the idea of JMP Pro. It's nice to have one version of JMP and get all the features. 
  • Mapping in Graph Builder should be sweet, but their map servers aren't up yet, so I'll have to figure out a way around this for now. I'm looking forward to a map/bubble plot combo.

Video: JMP 9 Excel Add-In Basics

Video: JMP 9 Excel Add-In Profiler

Tuesday, September 21, 2010

What is the Work of Dogs In This City?

What are the top industries in the City of Vancouver?

Knowledge workers include information, finance, insurance, real estate, science, management, and admin.
Obviously, there is not much mining going on in the city, but many mining companies are located here.

Digging deeper, what are the top sub-industries?

Automobile Dealers is slightly misleading, as Jim Pattison Group is a large and diversified company but is classified as an Automobile Dealer.

What are the most profitable industries in the City of Vancouver?

Looks like it pays to mine gold! These numbers are in thousands of dollars. 

What are the most profitable companies in Vancouver?

Those numbers are in thousands of dollars, so Teck made $2.655 billion. Finning is a company that sells and services Caterpillar equipment.

What is the highest revenue numbered corporation, and who is it?

417289 B.C. LTD had 2009 revenues of $126 million, has 800 employees. Care to guess who it is? The answer is at the bottom.

What is the biggest holding company in Vancouver, and who is it?

Shato Holdings had $958 million of revenue in 2008 and 6000 employees. Who is it better known as?

More Coming!


Numbered: Concord Security
Holding: White Spot. But Shato owns many other companies as well.

Tuesday, September 14, 2010

JMP 9 Discovery Summit is in Progress

Some new features that I haven't mentioned:

  • Default local variables in scripts
  • Turn script into add-in. New Add-Ins menu and filetype
  • JMP will use 8 cores if you have em
  • Can colour individual cells in the datatable
  • Alias-Optimal experiments
  • Excel Add-In lets you do Monte Carlo in Excel
  • Random Forests
Live blog of John Sall's talk

Monday, September 13, 2010

New info on JMP 9 pricing, and a slightly new due date

Looks like SAS realized that October 9 was a Saturday, so now JMP 9 will be out October 12. Pricing seems to be up slightly, comparison coming. Features as expected.

Thursday, September 9, 2010

Friday, August 13, 2010

JMP 9 Features - UPDATED

JMP 9 is "due September October 12, 2010"

Here is a list of new features in JMP 9:
  • Geographic maps in Graph Builder. With boundaries too. See my upcoming video shortly. 
  • There will be a product called JMP Pro, with extra features, such as Bootstrap Forests and Boosted Trees. Can handle "10s of millions of data points".
  • An Excel Add-In that lets you get your Excel data into JMP very easily, as well as use JMPs Profiler.
  • Train-Test-Validate for data mining
  • Integration with R using the JSL scripting language
  • Analysis of Means (ANOM)
  • An Add-In feature to create your own, and use other peoples' JMP add-ins
  • Export data to SAS
  • Accelerated Life DOE
  • Degradation analysis 
  • "Revamped" neural network platform
  • Some changes to JSL scripting, such as namespaces and scoping
  • Multidimensional Scaling (MDS)
  • Structural Equation Modeling (SEM)

UPDATE: Looks like there is a multidimensional scaling (MDS) add-in for JMP now:

Saturday, April 3, 2010

Video 1. Downloading and Installing RapidMiner 5

This video shows you how to download and install RapidMiner 5.


RapidMiner 5.0

RapidMiner registry keys:

HKEY_USERS\{your user ID}\Software\RapidI
HKEY_USERS\{your user ID}\Software\RapidMiner 5

Installation Folder:

C:\Program Files\Rapid-I\RapidMiner5

Settings Folder:

C:\documents and settings\{your user name}\.RapidMiner5