75,000 pageviews this year! Thanks to everyone for visiting. I will post some new material in the new year.
Have a safe and fun 2012
Neil
Saturday, December 31, 2011
Friday, November 4, 2011
My new blog about learning ExtJS
I have a new blog. It's about learning to use ExtJS, a great rich internet application library in javascript. Here it is:
http://extjs-tutorials.blogspot.com/
Check it out. Thanks
http://extjs-tutorials.blogspot.com/
Check it out. Thanks
Don't worry, I'll keep posting here too
Sunday, October 9, 2011
How Obama's data-crunching prowess may get him re-elected
An article on CNN about how the Obama 2012 campaign has hired many data miners and statisticians to help boost fundraising and support.
http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1
http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1
Saturday, October 8, 2011
Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents
After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.
I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.
Files from the video.
I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.
Files from the video.
Friday, September 2, 2011
Saturday, August 27, 2011
RapidMiner ETL - Transforming Attributes with Functions
In this video I show how to transform features in RapidMiner using operators such as log, sqrt, absolute value, and multiplying columns.
RapidMiner ETL - Normalizing, Discretizing, Recoding
In this video I show how to normalize an attribute, including z-normalization, how to discretize a column, and how to recode values
Thursday, August 25, 2011
RapidMiner ETL - Sampling, Selecting Rows, Attributes
In this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes.
You can get the dataset here
You can get the dataset here
RapidMiner ETL - Combining Datasets
In this video, I show how to combine multiple datasets into one, and join columns and append rows.
And We're Back. A video series on ETL with RapidMiner
Back with some more videos! Sorry for the long wait, and thanks for your patience.
This series is on ETL: Extract, Transform, Load with Rapidminer.
The first video shows how to combine multiple datasets into one, by joining columns and appending rows.
The second videos is on sampling and selecting rows and attributes.
More videos coming soon.
This series is on ETL: Extract, Transform, Load with Rapidminer.
The first video shows how to combine multiple datasets into one, by joining columns and appending rows.
The second videos is on sampling and selecting rows and attributes.
More videos coming soon.
Sunday, April 10, 2011
Monday, April 4, 2011
Web Scraping with RapidMiner and XPath
In this video I show how to load 500 html files from a previous web crawl, loop through each of them, and use XPath to grab values from each page, and put them in a data table for later analysis.
Part 2: Web Crawling with RapidMiner
Web Crawling with RapidMiner
Here is part 2 of my series of videos on web crawling with RapidMiner. In this video I show how to crawl about 500 pages from a site, and discuss user agents, crawling rules, and robot exclusion files.
Part 2: Web Crawling with RapidMiner
Sunday, April 3, 2011
More X-Path Goodness
Got a RapidMiner crawling/scraping video coming up, but for now, here are some more X-Path ideas to play with:
//*
return all nodes
//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.
//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)
//div[@id='div1']/../
return the parent node of a specific node
in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a
//*
return all nodes
//*[contains(., 'Search Text')]
return all nodes that contain Search Text in their content. Case sensitive search.
//div[@id='div1']/following-sibling::*
return the next sibling of a specific node (not sure if this works in RapidMiner)
//div[@id='div1']/../
return the parent node of a specific node
in RapidMiner, precede all nodes with "h:", example: //h:div[@class='abc']/h:a
Sunday, February 27, 2011
Web scraping with Google Spreadsheets and XPath
This is part one of a series of video tutorials on web scraping and web crawling.
In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.
Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.
Watch the video here:
Part B of the video is here (sorry about the crap sound, working on it):
Useful XPaths:
//a
grabs all the anchors (hyperlinks) in a document
//a/@href
grabs all the URLs in hyperlinks in a document
//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'
//input[@type='text']/..
grabs the parent element of all input text elements
count(//p)
returns the number of paragraph elements in a page
//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'
//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment
You can read more about XPath here:
https://developer.mozilla.org/en/XPath/Functions
and here:
http://www.w3schools.com/XPath/xpath_syntax.asp
Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages
In this first video, I show how to grab parts of a web page (scraping) using Google Docs Spreadsheets and XPath.
Google Spreadsheets has a nice function called importXML which will read in a web page. You can then apply an XPath to that page, to grab various parts of it, such as one particular value, or all of the hyperlinks. This is a convenient method, as your data will be in a format that is easily downloadable in Excel.
Watch the video here:
Part B of the video is here (sorry about the crap sound, working on it):
Useful XPaths:
//a
grabs all the anchors (hyperlinks) in a document
//a/@href
grabs all the URLs in hyperlinks in a document
//div[starts-with(@class, 'left')]
grabs all the div elements whose css class start with 'left'
//input[@type='text']/..
grabs the parent element of all input text elements
count(//p)
returns the number of paragraph elements in a page
//a[contains(@href, 'craigslist')]/@href
find all the hyperlinks that contain the word 'craigslist'
//blockquote/p[not(@align='center')]
find all the paragraphs that do NOT have center alignment
You can read more about XPath here:
https://developer.mozilla.org/en/XPath/Functions
and here:
http://www.w3schools.com/XPath/xpath_syntax.asp
Part 1: Web scraping with Google Spreadsheets and XPath
Part 2: Web Crawling with RapidMiner
Part 3: Web Scraping with RapidMiner and Xpath
Part 4: Web Scraping AJAX Pages
Friday, January 7, 2011
A Data Explosion Remakes Retailing
From the New York Times:
http://www.nytimes.com/2010/01/03/business/03unboxed.html
"Retailing is emerging as a real-world incubator for testing how computer firepower and smart software can be applied to social science — in this case, how variables like household economics and human behavior affect shopping."
http://www.nytimes.com/2010/01/03/business/03unboxed.html
Monday, January 3, 2011
Computers That Trade on the News
I missed this one around Christmas time...must have been the drive up to Prince George.
In the NY Times:
In the NY Times:
The number-crunchers on Wall Street are starting to crunch something else: the news.More...
Math-loving traders are using powerful computers to speed-read news reports, editorials, company Web sites, blog posts and even Twitter messages — and then letting the machines decide what it all means for the markets.
Subscribe to:
Posts (Atom)