Vancouver Data Blog by Neil McGuigan: November 2010

Tuesday, November 30, 2010

In Fridays's Globe & Mail: The algorithm method - Programming our lives away

A non-technical look at data mining

"Increasingly, algorithms are used to determine whether we can get access to credit, insurance and government services. They are posing a challenge to human decision-making in the arts. They are being used by prospective employers to decide if we should be hired. They can determine whether your online business will succeed or fail, and they have revolutionized the world of high finance."

Hat tip to Ron C!

Thursday, November 18, 2010

Indeed Job Trends: More C# jobs than C++

Indeed is a slick job board. I'm sure I am behind the times, but I just found their job trends application, much like Google Trends.

Here is C++ versus Java versus C# (USA only):

And some others...

Friday, November 12, 2010

Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization

This is the ~~final~~ second-to-last installment of a six-part series on text mining in RapidMiner. This video describes how to automatically categorize documents. This could be useful for a research project, or say finance.

You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.

Topics covered:

Cross-validation
The nearest neighbor learning algorithm
The naive bayes learning algorithm

Here is part 6

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.

Thanks for watching. Leave a comment for what you'd like to see next!

Also, check out the awesome RapidMiner finance videos on Neural Market Trends.

Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering

Thanks for watching.

This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.

In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.

Topics covered:

creating a word vector and calculating the terms' TF-IDF scores
calculating the similarity between documents using their cosine similarity
clustering documents using the K-Means algorithm

If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel.

Up next, automatically categorizing documents.

Thursday, November 11, 2010

Graph of the month. "Analytics" versus "Data Mining" on Google Trends

They mean the same thing, but it looks like the Analytics name has caught on.

Analytics
Data Mining

Wednesday, November 10, 2010

The Data Analytics Boom, in Forbes

Well, if it's in Forbes, it must be true.

Why analytics is taking off right now:

Total Quality and Six-Sigma taught statistics to engineers.
Goldman Sachs made mad money with statistical finance. This is a good read.
There's a crap-load of data available now. 2 exabytes a day of new data, though frankly it's mostly cat videos and Bieber tweets.
Cheap computers, cheap cloud computing, and good open source software like R and RapidMiner
read more below...

http://www.forbes.com/2010/11/05/google-facebook-computing-technology-data.html

Frankly, I think Competing on Analytics had a fair amount to do with it, as did the million dollar Netflix Prize.

How Canada became an open data and data journalism powerhouse

Open data and data journalism are blowing up.

http://www.guardian.co.uk/news/datablog/2010/nov/09/canada-open-data

Expect some good open data analysis coming up here soon!

Text Analytics With Rapidminer Part 3 of 6 - Association Rule Learning

Thanks for watching, and welcome Reddit!

This is part three of a six-part series on text mining in RapidMiner. This video describes how to find association rules in a collection of documents. An example would be if a job posting includes "data" and "mining" then it is also likely to include "RapidMiner". This is known as market basket analysis when applied to grocery stores :)

In this example, it can be useful for finding phrases and concepts that are important to job recruiters. You can use these phrases and concepts in your cover letter and resume, and increase your chances of getting them read.

Topics covered:

reading documents from a database
processing the text
creating a word vector
finding frequent itemsets using the FP-Growth algorithm
finding association rules
visualizing association rules

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.

Up next, calculating the similarity between documents.

Tuesday, November 9, 2010

Text Analytics With Rapidminer Part 2 of 6 - Processing Text

Wow, several hundred hits yesterday, thanks for watching everyone!

This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:

reading hundreds of documents from a database into RapidMiner
stripping HTML from content from a popular job posting board
tokenizing a document (splitting it into words)
removing overly common words (stopwords)
finding roots of words (stemming)
finding phrases in documents (n-grams)
and generating a word frequency table that describes important words in your documents

If you're not familiar with RapidMiner, you can see my other videos on my Youtube Channel.

Here's the video:

Your feedback and sharing is appreciated!

Up tomorrow, finding association rules in documents.

Monday, November 8, 2010

Join the International Open Data Hackathon, Dec 4

This is gonna be big. 37 cities signed up already. Add your app ideas and sign up here:

http://www.opendataday.org/

Memphis cuts crime 31% with predictive analytics

Can Vancouver do the same?

I haven't seen much on what the VPD does, do you have any links to share?

Here's the story

Found an answer:

"Vancouver-based Police Records Information Management Environment (PRIME) Inc. will start using IBM’s entity analytics to cut duplicates in its province-wide records sharing platform. How Canada’s privacy regulations make for fertile ground for IBM’s anonymous analytics"

Link

Text Analytics with RapidMiner Part 1 of 6 - Loading Text

I'll be releasing a new video on text mining with RapidMiner every day this week.

They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!

Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.

*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.

Later this week:

Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.

Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.

Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.

Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.

NEW: Applying A Model To New Documents

Hope you enjoy them.

See my other data mining videos here

Saturday, November 6, 2010

A five part video series on text mining with RapidMiner starts Monday

Stay tuned.

There will be five videos, with a sample application based on a popular job posting board:

loading text into RapidMiner (paste, file, group of files in folders, database)
processing text in RapidMiner (strip html, tokenize, n-grams, stemming, stopwords, frequency tables)
word vectorization and association rules with text
calculating the similarity between documents, clustering
automatically classifying documents and determining which words are important

Vancouver Data Blog by Neil McGuigan

Pages