This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:
- reading hundreds of documents from a database into RapidMiner
- stripping HTML from content from a popular job posting board
- tokenizing a document (splitting it into words)
- removing overly common words (stopwords)
- finding roots of words (stemming)
- finding phrases in documents (n-grams)
- and generating a word frequency table that describes important words in your documents
Here's the video:
Your feedback and sharing is appreciated!
Up tomorrow, finding association rules in documents.