Pages

Friday, November 12, 2010

Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization

This is the final second-to-last installment of a six-part series on text mining in RapidMiner. This video describes how to automatically categorize documents. This could be useful for a research project, or say finance.

You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.

Topics covered:
  • Cross-validation
  • The nearest neighbor learning algorithm
  • The naive bayes learning algorithm


Here is part 6

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.

Thanks for watching. Leave a comment for what you'd like to see next!

Also, check out the awesome RapidMiner finance videos on Neural Market Trends.

43 comments:

  1. Please prepare more videos about Text Analytics!
    Nice work!

    ReplyDelete
  2. Dear Niel,

    Thanks for the wonderful series of tutorials.

    Having created the model from the pre-categoriesed documents, how can you use this data to categorise text that has not allready been given a class label?


    Any help would be greatly appreciated.

    thanks
    Andrew

    ReplyDelete
  3. Hi Andrew,

    Generally what you would want to do is to save your model, then load it (usually in another process), apply it to unlabeled data (using the apply model operator). No cross-validation in this step.

    If you need big-time processing power, then RapidAnalytics (basically the server version, also free, opensource) would be good here.

    You'll want to update your model over time as well, as new info comes in.

    Make sense?

    Cheers

    Neil

    ReplyDelete
  4. Dear Niel,

    Thanks for the great tutorials. I am almost in the same situation as Andrew, who asked above how to classify unlabeled data. I've tried to do that in the past, but I had problems creating an example set, which is compatible with the training model. If you have only two classes for example (web, business), how do you use Process Documents operator to create the unlabeled data set?

    Thanks,
    Kostas

    ReplyDelete
  5. Hi,

    first of all: thanks for this great series!

    If you want to apply the trained model on unseen text data, this new text data has to be processed in exactly the same way as the training data (hence using the same preprocessing operators).

    AND you have to ensure that the processed data consists of the same attributes.

    This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick.

    Maybe another video could help here (as could the Rapid-I professsional support ;-)

    Ingo (Rapid-I)

    ReplyDelete
  6. Also thanks for the great series. I am now confident Rapidminer is a good tool for my purposes.

    You asked for input on what to cover next: I would 'vote' for web crawling and using the job database example further.

    I am interested in analyzing trends in job postings of a particular profession using Rapidminer. Would be great if I could do the whole thing with one tool!

    ReplyDelete
  7. Wonderful series. I really enjoyed going through the tutorials. Thank you!

    ReplyDelete
  8. Thanks for such an awesome tutorial. You have made it very easy.

    ReplyDelete
  9. Hi Neil,

    I tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
    "Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

    If I am not mistaken both the fields that you selected from your database are also text fields.

    Any help would be greatly appreciated!!!

    ReplyDelete
  10. Hi Neil,

    I tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
    "Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

    If I am not mistaken both the fields that you selected from your database are also text fields.

    Any help would be greatly appreciated!!!

    ReplyDelete
  11. Hi Chhavi, shoot me an email (contact info at the top) and we can try to work through it. Sounds like an easy fix

    ReplyDelete
  12. Neil,
    Just wondering how valuable the book "Fundamentals of Predictive Text Mining" is that you've listed on your site. Was it fairly straight forward to correspond this material to Rapidminer? A few comments on it's value would be greatly appreciated.
    Thanks,
    Jonathan

    ReplyDelete
  13. @Jonathan. It's a good book, and it helped me to understand text mining more deeply. If you have college/university access you can find it for free on SpringerLink as well.

    Generally it corresponded quite well to RM, using similar terminology.

    It doesn't have any reviews on amazon as it's pretty new, but is essentially a slighlty different version of this book, which is rated 4/5 stars by 5 users:

    http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/1441929967/ref=sr_1_1?s=books&ie=UTF8&qid=1295640027&sr=1-1

    You should also consider the text mining book by Konchady.

    Cheers

    ReplyDelete
  14. Niel i am currently working on a system that will be able to categorize emotions into different categories of the emotions. i currently have a data set of words expressing these emotions. how can i make use of RapidMiner to train this process.

    Is it also possible to save the training as a java document, hence being able to use the source code in netbeans and other Java API's??

    Thanx for the tutorial.

    Charles

    ReplyDelete
  15. Hi Charles,

    To classify, you need a label column, that is one or more classes. You will train your model with those classes, and then can use that model to make predictions on new data. So this will be similar to what I did in video 5, with the emotion class as the label column.

    Also, check out RapidAnalytics, which can turn a RapidMiner process into a java web service.

    Feel free to email me if you need some more help. Contact info at the top

    Neil

    ReplyDelete
  16. Neil, I can build the model and test but how do you use the "learned" model on a new data set and classify? How do you apply the model to new data? This is a follow on to Charles' question more a how to.

    ReplyDelete
  17. @anon march 7, please check and try Ingo's method. I will try to put up a video to explain it, though time is tight.

    ReplyDelete
  18. Excellent work Neil, great intro. I'm amazed at the reliability rate you got (92% I think).
    It would be great if you offered a link to the model, as it's a bit hard to read on the video.

    ReplyDelete
  19. Hi jusblad. I am working on video 6 in this series, and will post the data and process soon.

    ReplyDelete
  20. Quoting Ingo words: "This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick."

    How can I do that? I have the same problem.. I trained data which already labeled "positive" and "negative".. and I assume that as a train set.. I already had a test set and I want to apply the model to that test set.. how to do that? Really need ur help immediately :(

    and one more thing, do you have a tutorial for feature selection in this rapidminer? Kindly reply this comment or send ur reply to my email ias.naibaho@gmail.com .. thx a bunch :)

    ReplyDelete
  21. @ias Working on video 6 that explains your first question.

    I have a video on youtube about feature selection here: http://www.youtube.com/watch?v=JlhoTAk1ow8

    ReplyDelete
  22. dear Neil,

    I tried to watch the link but I think the video is unfinished yet..? pls check it again.. thx b4 :)

    ReplyDelete
  23. Hi Neil,

    I tried running the same analysis for my table which has similar two fields as shown in video. like category (in you DB , which is a text and the label for my model) and Usr_Text (jobtext in your case) , but it keeps throwing error showing error message.
    "Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

    In order to fix this issue, I have used a "Numerical to Polynomial" operator between "Process Document From Data" and "Select Attribute" operators. I have also modified the settings for K-NN operator to the following as I couldn't retain the settings for K-NN as said in the video because it was throwing an error " The operator k-NN does not have sufficient capabilities for the given data set: polynomial attributes not supported..."

    Measure types --> Nominal Measures
    Nominal Measure --> Nominal Distance

    I would like to also say that in the "Set Role" operator, as you had entered category for Name parameter in the video, I have an attribute called 'negative'

    When I execute the process, there is criterion selector at left hand side in the Output view showing accuracy and kappa. Upon selecting the accuracy and Table View, it displays the output in the tabular format but i don't see the category wise o/p as displayed in the video. It display pred 0, pred 0.707, pred 0.396 in the column and true 0, true 0.707, true 0.396 in the column with some values.

    Pls help to fix the issue.

    Regards,
    Vinay

    ReplyDelete
    Replies
    1. Hello Neil,

      I ran into exactly the same problem and would be grateful for any help on that topic!

      And by the way: big thanks for your excellent tutorials - it's a bit hard to get into RapidMiner and your videos make it a lot easier!


      Regards

      Kevin

      Delete
  24. Hello! Thanks for you support, it is really helpful.

    Can you explain how can I see the documents that have been bad-classified? I only know the results via confusion matrix, I would like to see exactly which classification were wrong! Is that possible?
    Many thanks,
    Nuno

    ReplyDelete
  25. Hi I am trying to label a yes/no field based on text obtained from a get pages process and some known values. The data comes from an excel file, the problem is that on the set role the only available name is text and not the yes_no field value. Basically I can only change the role of one attribute(text describing a website) and assign it as a label. What I need to achieve is a completely filled in column of yes/no based from an incomplete list of yes/no.

    do you think that the problem could be from reading from an excel file rather than a mysql db?

    ReplyDelete
  26. Hi Dear Neil.
    thanks a lot for your very useful tutorials on text mining using rapid miner. I have done all of the previous parts(1-5) using my data set in spss(.sav format). In 6th part I tried to do it again. in this case I created two variables, one of them is jobText and another one is Category. but when I run the process there is an error corresponding to "Set Roles" component.It keeps saying:"The attribute 'Category' doesn't exist!" and so it doesn't run. I don't know why the program doesn't recognize this variable. even when I make a break point before "Set role" component, and I run the program, unlike you I can't see any Category variable. I mean all the variables are just tokens that are extracted from text. Do you have any idea what should I do? I also tried to find your email address but I didn't manage to find it ! so I was compelled to write all this long story here.
    I'd be very grateful if you answer me.thanks a lot.

    ReplyDelete
    Replies
    1. Excuse me. I made a mistake. I have done 4 parts of the tutorials up to now ; Not 5 of them(I said 5 in my comment). And now I have problem in 5th one not 6th one.

      Delete
  27. hi Mr. McGuigan,
    I wanted to ask you if you can show us a video or even post your excel dataset for the document classifier. Maybe even a trimmed down version of it so we know what you mean by "label." Where would this field be located?

    Thanks,

    Neil

    ReplyDelete
  28. @ Neil, there's a link to the excel sheet in the comments of video 1

    A Label is the column that you are trying to predict. For example, if you are trying to predict the "category" of a document, then category is the label. It is equivalent to the "Y" in a regression.

    ReplyDelete
  29. I have web mining clicked in managed extensions but the processes do not appear

    ReplyDelete
  30. Hi Neil,
    I need to categorize a lot of web sites and I want to use their "meta keywords" tag as predefined categories in training data. i.e YouTube has "video, sharing, camera phone, video phone, free, upload" keywords. But as I understood, we can use only one category for a document. Could you please suggest your thoughts on this case. How can we categorize documents when in training data we have many categories for each row, not only 1 category.

    ReplyDelete
  31. Hi Neil,
    I found these tutorials really helpful while working on a project on text mining.
    I want to know how could I create my own word list in rapidminer-5 and use it (only these words) as an input to the operator "Process Documents from files".
    Please reply asap.

    Thanks,
    Vipul

    ReplyDelete
  32. Hi Neil,
    I followed your steps, but my K-NN classification vector accuracy is being reported as 0% at the end of the process. Do you have any idea why it would be so?

    Thanks,
    -Jai

    ReplyDelete
  33. hello sir,
    i am ram, i need to find the semantic similarity between words through wordnet(Wu and palmer similarity measure) for my project in order to cluster the documents(wsdl files)..
    Could u please suggest your thoughts....

    ReplyDelete
  34. Hello, thank you very much for this amazing videos. I am new to data mining, so I do apologize for this basic question. I am wondering how to or where to get data for learning. I do not have any database as many of you have. Is there any source for example with positive/negative feedback messages that can be used for model creation? Thanks.

    ReplyDelete
  35. Hello Neil,

    I must thank you for these videos. They have helped me no end in my University paper!

    The only improvements I can say is there needs to be a little more information about how the number of validations relates to the data set (As in if I put 10 into the box, would this then use ten documents for validation or 10% of all the documents). And to go a little bit more into how "Set Rule" words. It took trial and error to get that working in my case.

    But I must say, this is far more than I could have possibly hoped for help. Thank you again for taking the time to make these videos and sharing them!

    ReplyDelete
  36. Hi Thank you very much. I had a question from you. I did all these steps correctly. Rapidminer created AUC diagram for me but it takes my positive class by wrong. How can I change positive and negative class together. Now it select positive and negative class automatically. Where we can identify which one is positive or negative? Thanks.

    ReplyDelete
  37. @Neil Great tutorials you have there;
    I have just started with rapid miner; and I am still exploring its capabilities; if you can help me understand something...

    In order to apply any of the algorithms K-nn or Naive; should the data be stored in a form of table? or it can be directly supplied as text files?

    Cheers

    ReplyDelete
  38. @Neil Thank you for this tutorial. It's really helpful.

    But I have a question, I'm new to RapidMiner and I want to use Naive Bayes to classify my documents into 7 categories. I have a list of words for each category that I want to use as classifiers to determine the category of the document. I have followed your tutorial up to part 6 but I am getting very low accuracy.
    Hope to hear from you soon! :)

    ReplyDelete
  39. Hi,

    Can we use the decision tree for text auto-classification? If so, what should we consider in the model?

    ReplyDelete