Vancouver Data Blog by Neil McGuigan: Text Analytics With Rapidminer Part 4 of 6

Friday, November 12, 2010

Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering

Thanks for watching.

This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.

In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.

Topics covered:

creating a word vector and calculating the terms' TF-IDF scores
calculating the similarity between documents using their cosine similarity
clustering documents using the K-Means algorithm

If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel.

Up next, automatically categorizing documents.

15 comments:

AnonymousNovember 12, 2010 at 8:45 PM
Great job! Thx!
ReplyDelete
Replies
AnonymousDecember 2, 2010 at 11:43 AM
These are really great tutorials!!! Thanks!
I have a question however: is it possible to evaluate clustering algorithm somehow to get something like precision and recall(as for classification), or how to apply cross validation techniques for clustering right to get some indexes for further analysis?
Any ideas or explenations will be appreciated.
Thanks,
Chris
ReplyDelete
Replies
Neil McGuiganJanuary 25, 2011 at 11:55 PM
The best way to evaluate clustering is with an economics approach. In economics, the most efficient decision is when the marginal (incremental) cost equals the marginal benefit. Keep increasing the number of clusters until the marginal cost of clustering equals the marginal benefit of clustering.

For example, take marketing segmentation. If you have one, identical, "campaign" for all customers, it will be cheap, but ineffective. If you have an individualized campaign for each customer, it will be effective, but prohibitively costly.

If you go from 1 to 2 segments though, you will improve your response rate/revenue, with a slight increase in cost. Keep segmenting until the marginal cost of segmenting equals the marginal benefit of segmenting.
ReplyDelete
Replies
AnonymousMarch 2, 2011 at 10:01 AM
hi, i have a set of documents in text extension. I would like to calculate tf-idf and rank them using rapidminer. The problem is i not that familiar with rapid miner. Can you guide me ? thanks
ReplyDelete
Replies
Neil McGuiganMarch 2, 2011 at 10:24 AM
Please watch the rest of my videos on my Youtube Channel (top right of my blog). When you're one youtube, you will also see rapidminer videos by other people. good luck!
ReplyDelete
Replies
AnonymousAugust 11, 2011 at 4:34 AM
Hi,

Apart from document clustering, is it possible to cluster the words present in the documents ? If yes then please let me know the procedure to achieve it.

Thanks & Regards,
Vinay
ReplyDelete
Replies
Y.S. ChenOctober 3, 2011 at 10:29 PM
Thank you for teaching, there are question to ask
I give tf-idf from read database.
How to write cvs file??
thank you
ReplyDelete
Replies
Ahmed KhamassiNovember 24, 2011 at 12:55 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousDecember 7, 2011 at 12:42 AM
My gosh this is outstanding. Thanks!
ReplyDelete
Replies
Earl TharpeJanuary 4, 2012 at 8:17 AM
Thank you for the excellent tutorials. I have a question on this one - I followed your instructions and got the document comparison to work well.

However, as I scaled up the number of documents compared (upwards of 1000), Rapidminer started to crash. In the tutorial, you mention that this solution is appropriate for smaller numbers of documents. Any thoughts on how to do something similar with larger numbers?

Regards & thank you
ReplyDelete
Replies
AnonymousJanuary 17, 2012 at 4:42 AM
excellent tutorials on a tough subjet.
ReplyDelete
Replies
SethJune 21, 2012 at 3:03 PM
Great tutorials. Thanks a million, Neil. Was dying for a way to quantify document relatedness.
ReplyDelete
Replies
LarsJune 27, 2013 at 10:57 AM
Hi Neil,

Is it possible to get the data set you use within the tutorial? I wanna create a document similarity project for my university. Hope to hear from you soon.

Cheers, Lars
ReplyDelete
Replies
AnonymousFebruary 12, 2014 at 6:43 AM
I want tool to find out semantic similarity between two sentences...if any one know please tell me
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Friday, November 12, 2010

Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering

15 comments:

Archive