Blown away with the number of visitors!
While I'm here, here's a good article from Wired magazine about AI:
The A.I. Revolution Is On
Monday, December 27, 2010
Thursday, December 16, 2010
Next video series: Web crawling and scraping
I'll be working on a video series on web crawling and scraping over Christmas, for release at the end of December or the first week of January at the latest.
Web crawling and social network analysis were neck and neck on the poll, with social slightly ahead, but I am going to do web crawling first as I'm working on a web crawling project, so it will be fresher in my mind.
Web crawling and social network analysis were neck and neck on the poll, with social slightly ahead, but I am going to do web crawling first as I'm working on a web crawling project, so it will be fresher in my mind.
Tuesday, December 14, 2010
How to Filter By Value in RapidMiner
Use the Filter Examples operator on an exampleset (Data Transformation > Filtering)
set condition class to attribute_value_filter
set parameter string like attribute=value
example:
Category=healthcare
or
Category=customer service|healthcare
set condition class to attribute_value_filter
set parameter string like attribute=value
example:
Category=healthcare
or
Category=customer service|healthcare
- Both attribute and value are case-sensitive
- Spaces are allowed
- Use the | character for the "or" operator.
Custom stemming dictionary
You can create your own stemming dictionary in RapidMiner.
Add the Text Processing -> Stemming -> Stem (Dictionary) operator, and choose your dictionary file (plain text).
Your format should be like this:
example:
will turn fished into fish.
You can also use wildcards:
will turn fished, fishes, fishing or anything beginning with fish into fish.
You should put longer versions of similar words at the top. For example, to stem these words correctly:
computer, computerise, computerize, computerized, computerised, computers, compute, computed, computes
You should use
and not
assuming computer and compute are not the same stem.
Add the Text Processing -> Stemming -> Stem (Dictionary) operator, and choose your dictionary file (plain text).
Your format should be like this:
stem:inflection
stem:inflection
example:
fish:fished
will turn fished into fish.
You can also use wildcards:
fish:fish.*
will turn fished, fishes, fishing or anything beginning with fish into fish.
You should put longer versions of similar words at the top. For example, to stem these words correctly:
computer, computerise, computerize, computerized, computerised, computers, compute, computed, computes
You should use
computer:computer.*
compute:compute.*
and not
compute:compute.*
computer:computer.*
assuming computer and compute are not the same stem.
A regular expression to find "word A near word B" in RapidMiner
You can use the Text Processing->Extract Information operator to match regular expressions.
If you put the Extract Information operator inside a Process Documents operator, it will add a column to your dataset with the results of the match. Turn on "add meta information" option on the Process Documents operator.
Here's a simple regular expression to find a word near another word:
this will produce a match if "word1" has no more than "max" words between it and "word2". Example:
"The quick brown fox jumped over the lazy dog"
If you put the Extract Information operator inside a Process Documents operator, it will add a column to your dataset with the results of the match. Turn on "add meta information" option on the Process Documents operator.
Here's a simple regular expression to find a word near another word:
(word1\W+(?:\w+\W+){1,max}?word2)
this will produce a match if "word1" has no more than "max" words between it and "word2". Example:
"The quick brown fox jumped over the lazy dog"
(quick\W+(?:\w+\W+){1,5}?lazy)
will match, but(quick\W+(?:\w+\W+){1,5}?dog)
will not (it's has 6 words in between)
Saturday, December 11, 2010
So long old media, hello new media
"Financial services company Standard & Poor's announced changes yesterday to its S&P 500 stock market index, a widely respected compendium of large-cap U.S. public companies: Netflix is on the list for the first time. In the same announcement, rather poignantly, S&P announced that newspaper giant The New York Times Co. has been demoted to its index of mid-size companies (the MidCap 400), as has photography equipment manufacturer Eastman Kodak."
http://news.cnet.com/8301-13577_3-20025279-36.html
Wednesday, December 8, 2010
Which RapidMiner videos would you like to see next?
There's a poll on the top right of my blog. Let me know what you'd like to see next! Voting ends in one week.
Add comments to this post if you want something that's not on the list.
Thanks
Neil
Add comments to this post if you want something that's not on the list.
Thanks
Neil
Subscribe to:
Posts (Atom)