This is part four of a six-part series on text mining in RapidMiner. This video describes how to calculate the TF-IDF score for terms, calculate the similarity between documents, and cluster documents together. This can be useful for finding duplicate documents or database entries, and to show similar documents on a web page.
In the context of a job board, you could use it to find an interesting job, and then to find related ones as well.
- creating a word vector and calculating the terms' TF-IDF scores
- calculating the similarity between documents using their cosine similarity
- clustering documents using the K-Means algorithm
If you're not familiar with the free and open-source RapidMiner, see my other videos on my Youtube Channel.
Up next, automatically categorizing documents.