Tuesday, November 9, 2010

Text Analytics With Rapidminer Part 2 of 6 - Processing Text

Wow, several hundred hits yesterday, thanks for watching everyone!

This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:
  • reading hundreds of documents from a database into RapidMiner
  • stripping HTML from content from a popular job posting board
  • tokenizing a document (splitting it into words)
  • removing overly common words (stopwords)
  • finding roots of words (stemming)
  • finding phrases in documents (n-grams)
  • and generating a word frequency table that describes important words in your documents
If you're not familiar with RapidMiner, you can see my other videos on my Youtube Channel.

Here's the video:

Your feedback and sharing is appreciated!

Up tomorrow, finding association rules in documents.


  1. Great job! Thank you!

  2. Neil, I have done everything as described in the video from a Create Document and Process Document and I do not get the word list as shown. I followed each of your steps. What am I doing wrong?

  3. I am not sure. Try to simplify the process as much as possible. Also use the Debugging facility, where you can put breakpoints before or after each operator. Right click an operator to add a breakpoint, then view the results when you run the process.

  4. I am getting no errors. If I start simply with the Create Document I can do the pass thurs run the process and see the text so I know it reads. It is when I add the "Process Document" and connect the pass thurs and run the process that no values are generated in the "Word List" and no errors are showing. Here are some process logs. The first one concerns me is it a possible problem?

    Jan 21, 2011 2:00:27 PM INFO: No filename given for result file, using stdout for logging results!
    Jan 21, 2011 2:00:27 PM INFO: Loading initial data.
    Jan 21, 2011 2:00:27 PM INFO: Process //NewLocalRepository/New Test/New Test Process starts
    Jan 21, 2011 2:00:29 PM INFO: Saving results.
    Jan 21, 2011 2:00:29 PM INFO: Process //NewLocalRepository/New Test/New Test Process finished successfully after 1 s

    Here is Output in the Results Overview.
    Contains 0 entries.
    This list has been created on 1 documents assigned to 1 labels

    Thanks for videos thay are great.

    1. ey there, You have done a fantastic job. I will certainly digg it and personally suggest to my friends. I am confident they'll be benefited from this website.

    2. I am also getting same error could u solve that?

  5. Neil, I found a means of creating a wordlist. I had to add Wordlist to Data after the Process Documents. All is well for now.

  6. Neil,

    Thanks for the great videos. I am a manager of an internal IT help desk. We're wanting to do some trend analysis on the trouble calls that we receive so that we can perform root cause analysis to eliminate the root problems from our environment. I have a single column Excel file that I exported our work order database which contains the work order descriptions. I was hoping that I could use RapidMiner to to analyze the data to find patterns. Basically I need to compare the records to themselves to see if any trends exist. Unforutnately, I'm dealing with unstructured data that can differ depending on who entered the work order in the first place. Each row in the Excel sheet represents a separate work order that was submitted. Could RapidMiner be useful in this application? I've made a few quick processes in RapidMiner using your videos, but I'm not exactly sure which operations I need to manipulate my data in a fashion that will be useful to me. Any tips you could give me to point me in the right direction?

    Thanks again for the videos!

  7. @JD,

    Yes, rapidminer can help with this. you need to decide what you want to do, but it sounds like you need some more info before you can decide.

    since you have no labeled column (or dependent/Y variable) all you can do for now is descriptive analysis and unsupervised learning, meaning you can describe each factor (which in this case would be a word), describe the relationships between each factor, and do the following unsupervised learning functions:

    1. find similar observations (rows in excel)
    2. find similar factors (words)
    3. find which factors often co-occur.

    i cover most of these in my videos. option 3 is the association analysis video. option 1 is the similarity video. the descriptive part is covered in word frequencies. you can also look at correlation among factors but that can take a lot of RAM.

    i haven't covered option 2 yet, as it involves pretty heavy matrix algebra, but i will eventually. if you want to read up on principal components analysis (PCA) or singular value decomposition (SVD), then rapidminer can do those.

    you will notice that my data is essentially the same as yours. it's just one big column of text with many rows. each cell is split into each word in rapidminer.

    otherwise, you may want to extract labeled data from your dataset, for example, the product name or severity of the issue. then you can use the say naive bayes to see which factors are important (root causes, key indicators, etc). i'd be happy to help your company with that. feel free to email me, address on the contact page.

    good luck


    1. Hey Neil,

      This was quite helpful. Yes I have dataset pretty similar to you, where I have column of text with many rows. And each cell will split into each word in rapid miner.

      I am trying to reduce the dimensionality using PCA or SVD. I tried PCA, but with the eigenvectors (which ideally are a combination of different words), is not adding much value. Moreover the conversion back to original words is also another thing. I like the idea of PCA as it gives me variability coverage option. But just the conversion back throws me off. But I am interested in doing factor analysis and reduce my dimensions that way. But I see rapidminer doesnot have that option. How about Latent Semantic Analysis (LSA) based on Weka. I have read about it and it seems to be working like SVD / PCA.

      Another question I have is word vector matrix. I see SVD or LSA needs the matrix in a particular order. Coming out of Process Documents to Data node, I see my columns are words and my rows are documents. Is that the right input for SVD or LSA or I need to do the transpose matrix?

      Thank you!

    2. Oh I forgot to mention, mine is an unsupervised problem too very similar to the one you have described in your videos or the post above. And I am trying to create clusters / categories and score the documents. And rank them. But for me to do so, I need to eliminate / reduce similar words or unimportant words.

      Are you going to post a video for unsupervised learning. I know you have a clustering video, but the cluster labels don't give any information about the words contained in them. So I tried a classifier (naive bayes or decision trees) but somehow I feel reducing dimensions based on similar documents will be better and then clustering / classification / categorization may give me more homogeneous (pure) categories.

      Thank you!

  8. Hello. I'm making classification with RapidMiner. I generate and store the model. Then I open the model and try to classificate new and unlabelled text. However, I need the words list to make the processing of the new unlabelled text. I know how to write the words list (WordsListToData operator) but I don't know a way to get from data to words list!

    any help?

  9. Hi, thx for the videos, they are really usefull to me. I would need this dataset or a simylar one, because i would like to learn it, and try myself. Where or how can i download this? Thanks again.

  10. @ db_girl. You can make your own sample files just by copying and pasting some of the job postings

  11. @db_girl. Actually, for parts 1-5, the data files are available as a download link in the comments on the first post

  12. Oh, i didnt seen it. Thanks a lot. :)

  13. Hi

    Your videos are very helpful and have managed to make a lot of progress. I am having difficulty however in collecting phrases. I wish to collect 2 words put together such as 'happy birthday' rather than thier single tokens 'happy' and 'birthday'. Im not sure how to do this as everything i have tried so far outputs a black sheet.

  14. Hi Neil
    I am a new user of RapidMiner and am finding your videos and blog really useful so thank you :). My problem is a bit similiar to the above in that I am trying to use rapidminer to search my documents for a set of about 50 key terms and build a word frequency table based whether or not the key terms appear in the doc. I have no problems reading the documents and processing them normally to create the frequency table, however instead of all of the terms I just want the table be composed of the terms I am interested in. I had considered using the operator "keep document parts" but I am unsure whether this is the most efficient way of solving my problem! Any help truely appreciated

  15. @adam_redshaw, to find phrases, you will want to use the n-gram operator with a value of n above 2

  16. Hi Neil,

    Thank you for the videos.I have a question. If I want to mine a pdf or word doc which extraction can be used? Thank you in advance

  17. @ Gjorge. i just tried the Read Text operator with a PDF and it worked fine. You could also try an external java library to extract text from pdfs, such as .

    Didn't work with a DOCX file though. For that you could try:

  18. Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
    ". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.

  19. Hey there, You have done a fantastic job. I will certainly digg it and personally suggest to my friends. I am confident they'll be benefited from this website.

  20. Hi,

    I am trying to read Excel which has columns having some comments. I wanted to tokenize that data. But its not happening as per the process shown in your video. Could you please help!!!

  21. Hi Neil. I want to perform the steps you've shown in one of ur video on processing documents and do kmeans.
    I did manage to get it under GUI, but could I break the process into sub process (eg; 1-tokenize, lowercase, stopwords 2-stemming 3-tfidf 4-kmeans) If can, can you guide me how? I want to call it from my interface, using netbeans. If possible, can you share the code which call the .rmp and to specify its parameter? and also code to interprete result into txt file. I really need your help. Please.

  22. Hi, Thank you for your great videos. I am e beginner, I understand all the steps but I don't know how I can store my documents.
    I have about 30000 documents which are in xml format. how ReadDatabase can read them?

    thank you very much

    1. Hi Roya,

      In your case you should use the Process Documents from Files operator. You can see it in action in video 4 or 5



    2. Hello Neil, your blog post are very educative. However, i also have a lot of XML files that i want to cluster, i dont know what to do as i am new to rapidminer and would like you to point me the right direction. Thank you.

  23. Neil,
    I am working on some document clustering and have a problem. I have an excel file with a single column of text, and each row is a document. This seems to work well with the data to document operator, however once I have done this I cannot tokenize my data, since the tokenize operator seems to expect only a single document rather than many? I am stuck, and I have no idea how to work around this.

  24. Hello, great video. It helps alot. for the people who are using read excel. You have to use nominal to text othwerwise it won't work.

  25. Hello, Good evening.
    Thanks for the text classification videos, which is the structure of the database input?, I would be useful that information.

    Thanks for your reply.

  26. Hello, I am new to RapidMiner and I am trying to do SVD on some documents. I have one document per line in a file. How can I get RapidMiner to split this file into several documents? I tried using the cut document operator, but it refuses to accept the output of either the create document operator or the read document operator because they do not have any 'attribute's...

  27. Hi there. I had a question. I had a similar task and I follow your method in rapidminer. The difference is that when I select my context from database they come from two categories and I bring that field also. Is there any way to show these words belong to which categories? When I see the wordlist I cannot judge which one comes from which category! Any idea?

  28. Hi Neil,
    I have a question regarding the end result of the process you explained above. How do you explain the word list, their occurrences and the documents in which it was found? If i am comparing two websites or documents for the comparison of words, in the end result, how do i find out which word occurred on which document? For example, Can we add labels or anything to the result to display 'document A' has this word 100 times, 'document B' has this word 200 time etc.?

  29. hello,
    i am new to RapidMiner and i am trying to do count all the relative words, term frequency and on.. using Vector Creation TF - IDF. Process Documents form Files. I have three different Document files in different folders. i took all the documents into Text directories in Parameters tab after that i did vector creation using Tokenizer and Filter stop word(English) then executed but finally data showing in some special symbol format. so here i need output in English words so what can i do for that ?
    Do Reply..

  30. hi Neil,
    i'm new to RapidMiner and i am trying to text association rule for some documents. i have problem with Fp- Growth. in execution of the process before association rule Fb-growth taking much time even taking small amount of files. Im getting a process failure about memory. so how much memory is sufficient to run the process and why its coming? Reply.

  31. Hello, I am new to RapidMiner and I am trying to text processing on different word documents. When i am going to count the most frequent words, i have problem with Some Noisy data( like special characters ANª ,ANÃ,ANÌ ,ANù , AO ,AOoclóª ,AOã ,AP and etc ) are coming in the word list. can u help me out from this Noisy data. i want to Resume Sorting & Clustering using Text Mining Techniques.

  32. hi Neil McGuigan, I am new to Rapid Miner and i'm trying to segregate group of sample document resumes for job purpose. here in data view i'm getting some noisy data which is not relevant to my data, another one is how plot view works? can u please help me out of this problem.