A non-technical look at data mining
"Increasingly, algorithms are used to determine whether we can get access to credit, insurance and government services. They are posing a challenge to human decision-making in the arts. They are being used by prospective employers to decide if we should be hired. They can determine whether your online business will succeed or fail, and they have revolutionized the world of high finance."
Hat tip to Ron C!
Tuesday, November 30, 2010
Thursday, November 18, 2010
Indeed Job Trends: More C# jobs than C++
Indeed is a slick job board. I'm sure I am behind the times, but I just found their job trends application, much like Google Trends.
Here is C++ versus Java versus C# (USA only):
And some others...
Here is C++ versus Java versus C# (USA only):
And some others...
Friday, November 12, 2010
Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization
This is the final second-to-last installment of a six-part series on text mining in RapidMiner. This video describes how to automatically categorize documents. This could be useful for a research project, or say finance.
You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.
Topics covered:
Here is part 6
If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.
Thanks for watching. Leave a comment for what you'd like to see next!
Also, check out the awesome RapidMiner finance videos on Neural Market Trends.
You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.
Topics covered:
- Cross-validation
- The nearest neighbor learning algorithm
- The naive bayes learning algorithm
Here is part 6
If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.
Thanks for watching. Leave a comment for what you'd like to see next!
Also, check out the awesome RapidMiner finance videos on Neural Market Trends.
Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering
Thanks for watching.
This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.
In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.
Topics covered:
If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel.
Up next, automatically categorizing documents.
This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.
In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.
Topics covered:
- creating a word vector and calculating the terms' TF-IDF scores
- calculating the similarity between documents using their cosine similarity
- clustering documents using the K-Means algorithm
If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel.
Up next, automatically categorizing documents.
Thursday, November 11, 2010
Wednesday, November 10, 2010
The Data Analytics Boom, in Forbes
Well, if it's in Forbes, it must be true.
Why analytics is taking off right now:
Frankly, I think Competing on Analytics had a fair amount to do with it, as did the million dollar Netflix Prize.
Why analytics is taking off right now:
- Total Quality and Six-Sigma taught statistics to engineers.
- Goldman Sachs made mad money with statistical finance. This is a good read.
- There's a crap-load of data available now. 2 exabytes a day of new data, though frankly it's mostly cat videos and Bieber tweets.
- Cheap computers, cheap cloud computing, and good open source software like R and RapidMiner
- read more below...
Frankly, I think Competing on Analytics had a fair amount to do with it, as did the million dollar Netflix Prize.
How Canada became an open data and data journalism powerhouse
Open data and data journalism are blowing up.
http://www.guardian.co.uk/news/datablog/2010/nov/09/canada-open-data
Expect some good open data analysis coming up here soon!
http://www.guardian.co.uk/news/datablog/2010/nov/09/canada-open-data
Expect some good open data analysis coming up here soon!
Text Analytics With Rapidminer Part 3 of 6 - Association Rule Learning
Thanks for watching, and welcome Reddit!
This is part three of a six-part series on text mining in RapidMiner. This video describes how to find association rules in a collection of documents. An example would be if a job posting includes "data" and "mining" then it is also likely to include "RapidMiner". This is known as market basket analysis when applied to grocery stores :)
In this example, it can be useful for finding phrases and concepts that are important to job recruiters. You can use these phrases and concepts in your cover letter and resume, and increase your chances of getting them read.
Topics covered:
If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.
Up next, calculating the similarity between documents.
This is part three of a six-part series on text mining in RapidMiner. This video describes how to find association rules in a collection of documents. An example would be if a job posting includes "data" and "mining" then it is also likely to include "RapidMiner". This is known as market basket analysis when applied to grocery stores :)
In this example, it can be useful for finding phrases and concepts that are important to job recruiters. You can use these phrases and concepts in your cover letter and resume, and increase your chances of getting them read.
Topics covered:
- reading documents from a database
- processing the text
- creating a word vector
- finding frequent itemsets using the FP-Growth algorithm
- finding association rules
- visualizing association rules
If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.
Up next, calculating the similarity between documents.
Tuesday, November 9, 2010
Text Analytics With Rapidminer Part 2 of 6 - Processing Text
Wow, several hundred hits yesterday, thanks for watching everyone!
This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:
Here's the video:
Your feedback and sharing is appreciated!
Up tomorrow, finding association rules in documents.
This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:
- reading hundreds of documents from a database into RapidMiner
- stripping HTML from content from a popular job posting board
- tokenizing a document (splitting it into words)
- removing overly common words (stopwords)
- finding roots of words (stemming)
- finding phrases in documents (n-grams)
- and generating a word frequency table that describes important words in your documents
Here's the video:
Your feedback and sharing is appreciated!
Up tomorrow, finding association rules in documents.
Monday, November 8, 2010
Join the International Open Data Hackathon, Dec 4
This is gonna be big. 37 cities signed up already. Add your app ideas and sign up here:
http://www.opendataday.org/
http://www.opendataday.org/
Memphis cuts crime 31% with predictive analytics
Can Vancouver do the same?
I haven't seen much on what the VPD does, do you have any links to share?
Here's the story
Found an answer:
"Vancouver-based Police Records Information Management Environment (PRIME) Inc. will start using IBM’s entity analytics to cut duplicates in its province-wide records sharing platform. How Canada’s privacy regulations make for fertile ground for IBM’s anonymous analytics"
Link
I haven't seen much on what the VPD does, do you have any links to share?
Here's the story
Found an answer:
"Vancouver-based Police Records Information Management Environment (PRIME) Inc. will start using IBM’s entity analytics to cut duplicates in its province-wide records sharing platform. How Canada’s privacy regulations make for fertile ground for IBM’s anonymous analytics"
Link
Text Analytics with RapidMiner Part 1 of 6 - Loading Text
I'll be releasing a new video on text mining with RapidMiner every day this week.
They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!
Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.
*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.
Later this week:
Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.
Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.
Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.
Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.
NEW: Applying A Model To New Documents
Hope you enjoy them.
See my other data mining videos here
They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!
Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.
*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.
Later this week:
Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.
Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.
Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.
Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.
NEW: Applying A Model To New Documents
Hope you enjoy them.
See my other data mining videos here
Saturday, November 6, 2010
A five part video series on text mining with RapidMiner starts Monday
Stay tuned.
There will be five videos, with a sample application based on a popular job posting board:
There will be five videos, with a sample application based on a popular job posting board:
- loading text into RapidMiner (paste, file, group of files in folders, database)
- processing text in RapidMiner (strip html, tokenize, n-grams, stemming, stopwords, frequency tables)
- word vectorization and association rules with text
- calculating the similarity between documents, clustering
- automatically classifying documents and determining which words are important
Subscribe to:
Posts (Atom)