Vancouver Data Blog by Neil McGuigan: December 2010

Monday, December 27, 2010

10,000 views on my Youtube videos, 5,000 on my blog. Thanks everyone!

Blown away with the number of visitors!

While I'm here, here's a good article from Wired magazine about AI:

The A.I. Revolution Is On

Thursday, December 16, 2010

Next video series: Web crawling and scraping

I'll be working on a video series on web crawling and scraping over Christmas, for release at the end of December or the first week of January at the latest.

Web crawling and social network analysis were neck and neck on the poll, with social slightly ahead, but I am going to do web crawling first as I'm working on a web crawling project, so it will be fresher in my mind.

Tuesday, December 14, 2010

How to Filter By Value in RapidMiner

Use the Filter Examples operator on an exampleset (Data Transformation > Filtering)

set condition class to attribute_value_filter

set parameter string like attribute=value

example:

Category=healthcare

or

Category=customer service|healthcare

Both attribute and value are case-sensitive
Spaces are allowed
Use the | character for the "or" operator.

Custom stemming dictionary

You can create your own stemming dictionary in RapidMiner.

Add the Text Processing -> Stemming -> Stem (Dictionary) operator, and choose your dictionary file (plain text).

Your format should be like this:

stem:inflection

stem:inflection

example:

fish:fished

will turn fished into fish.

You can also use wildcards:

fish:fish.*

will turn fished, fishes, fishing or anything beginning with fish into fish.

You should put longer versions of similar words at the top. For example, to stem these words correctly:

computer, computerise, computerize, computerized, computerised, computers, compute, computed, computes

You should use

computer:computer.*

compute:compute.*

and not

compute:compute.*

computer:computer.*

assuming computer and compute are not the same stem.

A regular expression to find "word A near word B" in RapidMiner

You can use the Text Processing->Extract Information operator to match regular expressions.

If you put the Extract Information operator inside a Process Documents operator, it will add a column to your dataset with the results of the match. Turn on "add meta information" option on the Process Documents operator.

Here's a simple regular expression to find a word near another word:

(word1\W+(?:\w+\W+){1,max}?word2)

this will produce a match if "word1" has no more than "max" words between it and "word2". Example:

"The quick brown fox jumped over the lazy dog"

(quick\W+(?:\w+\W+){1,5}?lazy) will match, but

(quick\W+(?:\w+\W+){1,5}?dog) will not (it's has 6 words in between)

Saturday, December 11, 2010

So long old media, hello new media

"Financial services company Standard & Poor's announced changes yesterday to its S&P 500 stock market index, a widely respected compendium of large-cap U.S. public companies: Netflix is on the list for the first time. In the same announcement, rather poignantly, S&P announced that newspaper giant The New York Times Co. has been demoted to its index of mid-size companies (the MidCap 400), as has photography equipment manufacturer Eastman Kodak."

http://news.cnet.com/8301-13577_3-20025279-36.html

Wednesday, December 8, 2010

Which RapidMiner videos would you like to see next?

There's a poll on the top right of my blog. Let me know what you'd like to see next! Voting ends in one week.

Add comments to this post if you want something that's not on the list.

Thanks

Neil

Vancouver Data Blog by Neil McGuigan

Pages