You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.
Topics covered:
- Cross-validation
- The nearest neighbor learning algorithm
- The naive bayes learning algorithm
Here is part 6
If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.
Thanks for watching. Leave a comment for what you'd like to see next!
Also, check out the awesome RapidMiner finance videos on Neural Market Trends.
Please prepare more videos about Text Analytics!
ReplyDeleteNice work!
Dear Niel,
ReplyDeleteThanks for the wonderful series of tutorials.
Having created the model from the pre-categoriesed documents, how can you use this data to categorise text that has not allready been given a class label?
Any help would be greatly appreciated.
thanks
Andrew
Hi Andrew,
ReplyDeleteGenerally what you would want to do is to save your model, then load it (usually in another process), apply it to unlabeled data (using the apply model operator). No cross-validation in this step.
If you need big-time processing power, then RapidAnalytics (basically the server version, also free, opensource) would be good here.
You'll want to update your model over time as well, as new info comes in.
Make sense?
Cheers
Neil
Dear Niel,
ReplyDeleteThanks for the great tutorials. I am almost in the same situation as Andrew, who asked above how to classify unlabeled data. I've tried to do that in the past, but I had problems creating an example set, which is compatible with the training model. If you have only two classes for example (web, business), how do you use Process Documents operator to create the unlabeled data set?
Thanks,
Kostas
Hi,
ReplyDeletefirst of all: thanks for this great series!
If you want to apply the trained model on unseen text data, this new text data has to be processed in exactly the same way as the training data (hence using the same preprocessing operators).
AND you have to ensure that the processed data consists of the same attributes.
This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick.
Maybe another video could help here (as could the Rapid-I professsional support ;-)
Ingo (Rapid-I)
from the man himself! Thanks Ingo.
ReplyDeleteAlso thanks for the great series. I am now confident Rapidminer is a good tool for my purposes.
ReplyDeleteYou asked for input on what to cover next: I would 'vote' for web crawling and using the job database example further.
I am interested in analyzing trends in job postings of a particular profession using Rapidminer. Would be great if I could do the whole thing with one tool!
Wonderful series. I really enjoyed going through the tutorials. Thank you!
ReplyDeleteThanks for such an awesome tutorial. You have made it very easy.
ReplyDeleteHi Neil,
ReplyDeleteI tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".
If I am not mistaken both the fields that you selected from your database are also text fields.
Any help would be greatly appreciated!!!
Hi Neil,
ReplyDeleteI tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".
If I am not mistaken both the fields that you selected from your database are also text fields.
Any help would be greatly appreciated!!!
Hi Chhavi, shoot me an email (contact info at the top) and we can try to work through it. Sounds like an easy fix
ReplyDeleteNeil,
ReplyDeleteJust wondering how valuable the book "Fundamentals of Predictive Text Mining" is that you've listed on your site. Was it fairly straight forward to correspond this material to Rapidminer? A few comments on it's value would be greatly appreciated.
Thanks,
Jonathan
@Jonathan. It's a good book, and it helped me to understand text mining more deeply. If you have college/university access you can find it for free on SpringerLink as well.
ReplyDeleteGenerally it corresponded quite well to RM, using similar terminology.
It doesn't have any reviews on amazon as it's pretty new, but is essentially a slighlty different version of this book, which is rated 4/5 stars by 5 users:
http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/1441929967/ref=sr_1_1?s=books&ie=UTF8&qid=1295640027&sr=1-1
You should also consider the text mining book by Konchady.
Cheers
Niel i am currently working on a system that will be able to categorize emotions into different categories of the emotions. i currently have a data set of words expressing these emotions. how can i make use of RapidMiner to train this process.
ReplyDeleteIs it also possible to save the training as a java document, hence being able to use the source code in netbeans and other Java API's??
Thanx for the tutorial.
Charles
Hi Charles,
ReplyDeleteTo classify, you need a label column, that is one or more classes. You will train your model with those classes, and then can use that model to make predictions on new data. So this will be similar to what I did in video 5, with the emotion class as the label column.
Also, check out RapidAnalytics, which can turn a RapidMiner process into a java web service.
Feel free to email me if you need some more help. Contact info at the top
Neil
Neil, I can build the model and test but how do you use the "learned" model on a new data set and classify? How do you apply the model to new data? This is a follow on to Charles' question more a how to.
ReplyDelete@anon march 7, please check and try Ingo's method. I will try to put up a video to explain it, though time is tight.
ReplyDeleteExcellent work Neil, great intro. I'm amazed at the reliability rate you got (92% I think).
ReplyDeleteIt would be great if you offered a link to the model, as it's a bit hard to read on the video.
Hi jusblad. I am working on video 6 in this series, and will post the data and process soon.
ReplyDeleteQuoting Ingo words: "This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick."
ReplyDeleteHow can I do that? I have the same problem.. I trained data which already labeled "positive" and "negative".. and I assume that as a train set.. I already had a test set and I want to apply the model to that test set.. how to do that? Really need ur help immediately :(
and one more thing, do you have a tutorial for feature selection in this rapidminer? Kindly reply this comment or send ur reply to my email ias.naibaho@gmail.com .. thx a bunch :)
@ias Working on video 6 that explains your first question.
ReplyDeleteI have a video on youtube about feature selection here: http://www.youtube.com/watch?v=JlhoTAk1ow8
dear Neil,
ReplyDeleteI tried to watch the link but I think the video is unfinished yet..? pls check it again.. thx b4 :)
Hi Neil,
ReplyDeleteI tried running the same analysis for my table which has similar two fields as shown in video. like category (in you DB , which is a text and the label for my model) and Usr_Text (jobtext in your case) , but it keeps throwing error showing error message.
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".
In order to fix this issue, I have used a "Numerical to Polynomial" operator between "Process Document From Data" and "Select Attribute" operators. I have also modified the settings for K-NN operator to the following as I couldn't retain the settings for K-NN as said in the video because it was throwing an error " The operator k-NN does not have sufficient capabilities for the given data set: polynomial attributes not supported..."
Measure types --> Nominal Measures
Nominal Measure --> Nominal Distance
I would like to also say that in the "Set Role" operator, as you had entered category for Name parameter in the video, I have an attribute called 'negative'
When I execute the process, there is criterion selector at left hand side in the Output view showing accuracy and kappa. Upon selecting the accuracy and Table View, it displays the output in the tabular format but i don't see the category wise o/p as displayed in the video. It display pred 0, pred 0.707, pred 0.396 in the column and true 0, true 0.707, true 0.396 in the column with some values.
Pls help to fix the issue.
Regards,
Vinay
Hello Neil,
DeleteI ran into exactly the same problem and would be grateful for any help on that topic!
And by the way: big thanks for your excellent tutorials - it's a bit hard to get into RapidMiner and your videos make it a lot easier!
Regards
Kevin
Hello! Thanks for you support, it is really helpful.
ReplyDeleteCan you explain how can I see the documents that have been bad-classified? I only know the results via confusion matrix, I would like to see exactly which classification were wrong! Is that possible?
Many thanks,
Nuno
Hi I am trying to label a yes/no field based on text obtained from a get pages process and some known values. The data comes from an excel file, the problem is that on the set role the only available name is text and not the yes_no field value. Basically I can only change the role of one attribute(text describing a website) and assign it as a label. What I need to achieve is a completely filled in column of yes/no based from an incomplete list of yes/no.
ReplyDeletedo you think that the problem could be from reading from an excel file rather than a mysql db?
Hi Dear Neil.
ReplyDeletethanks a lot for your very useful tutorials on text mining using rapid miner. I have done all of the previous parts(1-5) using my data set in spss(.sav format). In 6th part I tried to do it again. in this case I created two variables, one of them is jobText and another one is Category. but when I run the process there is an error corresponding to "Set Roles" component.It keeps saying:"The attribute 'Category' doesn't exist!" and so it doesn't run. I don't know why the program doesn't recognize this variable. even when I make a break point before "Set role" component, and I run the program, unlike you I can't see any Category variable. I mean all the variables are just tokens that are extracted from text. Do you have any idea what should I do? I also tried to find your email address but I didn't manage to find it ! so I was compelled to write all this long story here.
I'd be very grateful if you answer me.thanks a lot.
Excuse me. I made a mistake. I have done 4 parts of the tutorials up to now ; Not 5 of them(I said 5 in my comment). And now I have problem in 5th one not 6th one.
Deletehi Mr. McGuigan,
ReplyDeleteI wanted to ask you if you can show us a video or even post your excel dataset for the document classifier. Maybe even a trimmed down version of it so we know what you mean by "label." Where would this field be located?
Thanks,
Neil
@ Neil, there's a link to the excel sheet in the comments of video 1
ReplyDeleteA Label is the column that you are trying to predict. For example, if you are trying to predict the "category" of a document, then category is the label. It is equivalent to the "Y" in a regression.
I have web mining clicked in managed extensions but the processes do not appear
ReplyDeleteHi Neil,
ReplyDeleteI need to categorize a lot of web sites and I want to use their "meta keywords" tag as predefined categories in training data. i.e YouTube has "video, sharing, camera phone, video phone, free, upload" keywords. But as I understood, we can use only one category for a document. Could you please suggest your thoughts on this case. How can we categorize documents when in training data we have many categories for each row, not only 1 category.
Hi Neil,
ReplyDeleteI found these tutorials really helpful while working on a project on text mining.
I want to know how could I create my own word list in rapidminer-5 and use it (only these words) as an input to the operator "Process Documents from files".
Please reply asap.
Thanks,
Vipul
Hi Neil,
ReplyDeleteI followed your steps, but my K-NN classification vector accuracy is being reported as 0% at the end of the process. Do you have any idea why it would be so?
Thanks,
-Jai
thanks for videos
ReplyDeletehello sir,
ReplyDeletei am ram, i need to find the semantic similarity between words through wordnet(Wu and palmer similarity measure) for my project in order to cluster the documents(wsdl files)..
Could u please suggest your thoughts....
Hello, thank you very much for this amazing videos. I am new to data mining, so I do apologize for this basic question. I am wondering how to or where to get data for learning. I do not have any database as many of you have. Is there any source for example with positive/negative feedback messages that can be used for model creation? Thanks.
ReplyDeleteHello Neil,
ReplyDeleteI must thank you for these videos. They have helped me no end in my University paper!
The only improvements I can say is there needs to be a little more information about how the number of validations relates to the data set (As in if I put 10 into the box, would this then use ten documents for validation or 10% of all the documents). And to go a little bit more into how "Set Rule" words. It took trial and error to get that working in my case.
But I must say, this is far more than I could have possibly hoped for help. Thank you again for taking the time to make these videos and sharing them!
Hi Thank you very much. I had a question from you. I did all these steps correctly. Rapidminer created AUC diagram for me but it takes my positive class by wrong. How can I change positive and negative class together. Now it select positive and negative class automatically. Where we can identify which one is positive or negative? Thanks.
ReplyDelete@Neil Great tutorials you have there;
ReplyDeleteI have just started with rapid miner; and I am still exploring its capabilities; if you can help me understand something...
In order to apply any of the algorithms K-nn or Naive; should the data be stored in a form of table? or it can be directly supplied as text files?
Cheers
@Neil Thank you for this tutorial. It's really helpful.
ReplyDeleteBut I have a question, I'm new to RapidMiner and I want to use Naive Bayes to classify my documents into 7 categories. I have a list of words for each category that I want to use as classifiers to determine the category of the document. I have followed your tutorial up to part 6 but I am getting very low accuracy.
Hope to hear from you soon! :)
Hi,
ReplyDeleteCan we use the decision tree for text auto-classification? If so, what should we consider in the model?