Vancouver Data Blog by Neil McGuigan: 2011

Saturday, December 31, 2011

Happy New Year

75,000 pageviews this year! Thanks to everyone for visiting. I will post some new material in the new year.

Have a safe and fun 2012

Neil

Friday, November 4, 2011

My new blog about learning ExtJS

I have a new blog. It's about learning to use ExtJS, a great rich internet application library in javascript. Here it is:

http://extjs-tutorials.blogspot.com/

Check it out. Thanks

Don't worry, I'll keep posting here too

Sunday, October 9, 2011

How Obama's data-crunching prowess may get him re-elected

An article on CNN about how the Obama 2012 campaign has hired many data miners and statisticians to help boost fundraising and support.

http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1

Saturday, October 8, 2011

Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents

After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.

I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.

Files from the video.

Friday, September 2, 2011

September sunset

Saturday, August 27, 2011

RapidMiner ETL - Transforming Attributes with Functions

In this video I show how to transform features in RapidMiner using operators such as log, sqrt, absolute value, and multiplying columns.

RapidMiner ETL - Normalizing, Discretizing, Recoding

In this video I show how to normalize an attribute, including z-normalization, how to discretize a column, and how to recode values

Thursday, August 25, 2011

RapidMiner ETL - Sampling, Selecting Rows, Attributes

In this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes.

You can get the dataset here

RapidMiner ETL - Combining Datasets

In this video, I show how to combine multiple datasets into one, and join columns and append rows.

And We're Back. A video series on ETL with RapidMiner

Back with some more videos! Sorry for the long wait, and thanks for your patience.

This series is on ETL: Extract, Transform, Load with Rapidminer.

The first video shows how to combine multiple datasets into one, by joining columns and appending rows.

The second videos is on sampling and selecting rows and attributes.

More videos coming soon.

Sunday, April 10, 2011

A rainy sunday in downtown Vancouver

My blog should look better on mobile devices now.

Monday, April 4, 2011

Web Scraping with RapidMiner and XPath

In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.

Part 1: Web scraping with Google Spreadsheets and XPath

Part 2: Web Crawling with RapidMiner

Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

Web Crawling with RapidMiner

Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.

Part 1: Web scraping with Google Spreadsheets and XPath

Part 2: Web Crawling with RapidMiner

Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

Sunday, April 3, 2011

More X-Path Goodness

Got a RapidMiner crawling/scraping video coming up, but for now, here are some more X-Path ideas to play with:

//*
return all nodes

//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.

//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)

//div[@id='div1']/../
return the parent node of a specific node

in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a

Sunday, February 27, 2011

Web scraping with Google Spreadsheets and XPath

This is part one of a series of video tutorials on web scraping and web crawling.

In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.

Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.

Watch the video here:

Part B of the video is here (sorry about the crap sound, working on it):

Useful XPaths:

//a
grabs all the anchors (hyperlinks) in a document

//a/@href
grabs all the URLs in hyperlinks in a document

//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'

//input[@type='text']/..
grabs the parent element of all input text elements

count(//p)
returns the number of paragraph elements in a page

//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'

//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment

You can read more about XPath here:

https://developer.mozilla.org/en/XPath/Functions

and here:

http://www.w3schools.com/XPath/xpath_syntax.asp

Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages

Friday, January 7, 2011

A Data Explosion Remakes Retailing

From the New York Times:

"Retailing is emerging as a real-world incubator for testing how computer firepower and smart software can be applied to social science — in this case, how variables like household economics and human behavior affect shopping."

http://www.nytimes.com/2010/01/03/business/03unboxed.html

Monday, January 3, 2011

Computers That Trade on the News

I missed this one around Christmas time...must have been the drive up to Prince George.

In the NY Times:

The number-crunchers on Wall Street are starting to crunch something else: the news.

Math-loving traders are using powerful computers to speed-read news reports, editorials, company Web sites, blog posts and even Twitter messages — and then letting the machines decide what it all means for the markets.

More...

Pages