Friday, November 12, 2010

Text Analytics With Rapidminer Part 4 of 6 - Document Similarity and Clustering

Thanks for watching.

This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.

In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.

Topics covered:
  • creating a word vector and calculating the terms' TF-IDF scores
  • calculating the similarity between documents using their cosine similarity
  • clustering documents using the K-Means algorithm

If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel

Up next, automatically categorizing documents.


  1. These are really great tutorials!!! Thanks!
    I have a question however: is it possible to evaluate clustering algorithm somehow to get something like precision and recall(as for classification), or how to apply cross validation techniques for clustering right to get some indexes for further analysis?
    Any ideas or explenations will be appreciated.

  2. The best way to evaluate clustering is with an economics approach. In economics, the most efficient decision is when the marginal (incremental) cost equals the marginal benefit. Keep increasing the number of clusters until the marginal cost of clustering equals the marginal benefit of clustering.

    For example, take marketing segmentation. If you have one, identical, "campaign" for all customers, it will be cheap, but ineffective. If you have an individualized campaign for each customer, it will be effective, but prohibitively costly.

    If you go from 1 to 2 segments though, you will improve your response rate/revenue, with a slight increase in cost. Keep segmenting until the marginal cost of segmenting equals the marginal benefit of segmenting.

    1. Dear Neil; Thanks for your great tutorials. I had a question: How the appropriateness of the generated clusters can be measured?
      How we can quantify the fitness of the clusters?
      Is it possible to compute RMSSTD (Root mean square standard deviation) or RS (R-Squared) in rapid miner to evaluate clustering process??
      I'll really appreciate it if answer my question.
      Thanks for you help.

  3. hi, i have a set of documents in text extension. I would like to calculate tf-idf and rank them using rapidminer. The problem is i not that familiar with rapid miner. Can you guide me ? thanks

  4. Please watch the rest of my videos on my Youtube Channel (top right of my blog). When you're one youtube, you will also see rapidminer videos by other people. good luck!

  5. Hi,

    Apart from document clustering, is it possible to cluster the words present in the documents ? If yes then please let me know the procedure to achieve it.

    Thanks & Regards,

  6. Thank you for teaching, there are question to ask
    I give tf-idf from read database.
    How to write cvs file??
    thank you

  7. This comment has been removed by the author.

  8. My gosh this is outstanding. Thanks!

  9. Thank you for the excellent tutorials. I have a question on this one - I followed your instructions and got the document comparison to work well.

    However, as I scaled up the number of documents compared (upwards of 1000), Rapidminer started to crash. In the tutorial, you mention that this solution is appropriate for smaller numbers of documents. Any thoughts on how to do something similar with larger numbers?

    Regards & thank you

  10. excellent tutorials on a tough subjet.

  11. Great tutorials. Thanks a million, Neil. Was dying for a way to quantify document relatedness.

  12. Hi Neil,

    Is it possible to get the data set you use within the tutorial? I wanna create a document similarity project for my university. Hope to hear from you soon.

    Cheers, Lars

  13. I want tool to find out semantic similarity between two sentences...if any one know please tell me