Vancouver Data Blog by Neil McGuigan: Text Analytics With Rapidminer Part 5 of 6

Friday, November 12, 2010

Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization

This is the ~~final~~ second-to-last installment of a six-part series on text mining in RapidMiner. This video describes how to automatically categorize documents. This could be useful for a research project, or say finance.

You could use it to classify documents as "positive" or "negative", thus doing sentiment analysis. You could do it with financial news text, and classify documents as "stock went up" or "stock went down" after the release, and make (short-term) predictions of future stock movements. You can also see which words are important discriminants. Once you've trained a learning algorithm, you can use it on unseen data.

Topics covered:

Cross-validation
The nearest neighbor learning algorithm
The naive bayes learning algorithm

Here is part 6

If you're not familiar with RapidMiner, see my other videos on my Youtube Channel.

Thanks for watching. Leave a comment for what you'd like to see next!

Also, check out the awesome RapidMiner finance videos on Neural Market Trends.

43 comments:

AnonymousNovember 12, 2010 at 8:58 PM
Please prepare more videos about Text Analytics!
Nice work!
ReplyDelete
Replies
AnonymousNovember 14, 2010 at 5:27 PM
Dear Niel,

Thanks for the wonderful series of tutorials.

Having created the model from the pre-categoriesed documents, how can you use this data to categorise text that has not allready been given a class label?

Any help would be greatly appreciated.

thanks
Andrew
ReplyDelete
Replies
Neil McGuiganNovember 14, 2010 at 5:48 PM
Hi Andrew,

Generally what you would want to do is to save your model, then load it (usually in another process), apply it to unlabeled data (using the apply model operator). No cross-validation in this step.

If you need big-time processing power, then RapidAnalytics (basically the server version, also free, opensource) would be good here.

You'll want to update your model over time as well, as new info comes in.

Make sense?

Cheers

Neil
ReplyDelete
Replies
Kostas GiannakakisNovember 15, 2010 at 6:54 AM
Dear Niel,

Thanks for the great tutorials. I am almost in the same situation as Andrew, who asked above how to classify unlabeled data. I've tried to do that in the past, but I had problems creating an example set, which is compatible with the training model. If you have only two classes for example (web, business), how do you use Process Documents operator to create the unlabeled data set?

Thanks,
Kostas
ReplyDelete
Replies
Ingo MierswaNovember 16, 2010 at 4:07 AM
Hi,

first of all: thanks for this great series!

If you want to apply the trained model on unseen text data, this new text data has to be processed in exactly the same way as the training data (hence using the same preprocessing operators).

AND you have to ensure that the processed data consists of the same attributes.

This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick.

Maybe another video could help here (as could the Rapid-I professsional support ;-)

Ingo (Rapid-I)
ReplyDelete
Replies
Neil McGuiganNovember 16, 2010 at 9:59 AM
from the man himself! Thanks Ingo.
ReplyDelete
Replies
AnonymousNovember 16, 2010 at 2:45 PM
Also thanks for the great series. I am now confident Rapidminer is a good tool for my purposes.

You asked for input on what to cover next: I would 'vote' for web crawling and using the job database example further.

I am interested in analyzing trends in job postings of a particular profession using Rapidminer. Would be great if I could do the whole thing with one tool!
ReplyDelete
Replies
TonyNovember 17, 2010 at 3:52 PM
Wonderful series. I really enjoyed going through the tutorials. Thank you!
ReplyDelete
Replies
KannanNovember 23, 2010 at 2:18 PM
Thanks for such an awesome tutorial. You have made it very easy.
ReplyDelete
Replies
AnonymousDecember 15, 2010 at 7:04 AM
Hi Neil,

I tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

If I am not mistaken both the fields that you selected from your database are also text fields.

Any help would be greatly appreciated!!!
ReplyDelete
Replies
ChhaviDecember 15, 2010 at 9:57 PM
Hi Neil,

I tried running the same analysis for my table which has similar two fields like that of yours job site data polarity(category in you DB , which is a text and the label for my model) and text (jobtext in your case) , but it keeps throwing error showing error message
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

If I am not mistaken both the fields that you selected from your database are also text fields.

Any help would be greatly appreciated!!!
ReplyDelete
Replies
Neil McGuiganJanuary 20, 2011 at 10:56 PM
Hi Chhavi, shoot me an email (contact info at the top) and we can try to work through it. Sounds like an easy fix
ReplyDelete
Replies
Jonathan NobleJanuary 21, 2011 at 4:14 AM
Neil,
Just wondering how valuable the book "Fundamentals of Predictive Text Mining" is that you've listed on your site. Was it fairly straight forward to correspond this material to Rapidminer? A few comments on it's value would be greatly appreciated.
Thanks,
Jonathan
ReplyDelete
Replies
Neil McGuiganJanuary 21, 2011 at 12:04 PM
@Jonathan. It's a good book, and it helped me to understand text mining more deeply. If you have college/university access you can find it for free on SpringerLink as well.

Generally it corresponded quite well to RM, using similar terminology.

It doesn't have any reviews on amazon as it's pretty new, but is essentially a slighlty different version of this book, which is rated 4/5 stars by 5 users:

http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/1441929967/ref=sr_1_1?s=books&ie=UTF8&qid=1295640027&sr=1-1

You should also consider the text mining book by Konchady.

Cheers
ReplyDelete
Replies
G52HCIFebruary 28, 2011 at 7:55 PM
Niel i am currently working on a system that will be able to categorize emotions into different categories of the emotions. i currently have a data set of words expressing these emotions. how can i make use of RapidMiner to train this process.

Is it also possible to save the training as a java document, hence being able to use the source code in netbeans and other Java API's??

Thanx for the tutorial.

Charles
ReplyDelete
Replies
Neil McGuiganMarch 1, 2011 at 10:23 PM
Hi Charles,

To classify, you need a label column, that is one or more classes. You will train your model with those classes, and then can use that model to make predictions on new data. So this will be similar to what I did in video 5, with the emotion class as the label column.

Also, check out RapidAnalytics, which can turn a RapidMiner process into a java web service.

Feel free to email me if you need some more help. Contact info at the top

Neil
ReplyDelete
Replies
AnonymousMarch 7, 2011 at 4:57 PM
Neil, I can build the model and test but how do you use the "learned" model on a new data set and classify? How do you apply the model to new data? This is a follow on to Charles' question more a how to.
ReplyDelete
Replies
Neil McGuiganMarch 8, 2011 at 8:05 PM
@anon march 7, please check and try Ingo's method. I will try to put up a video to explain it, though time is tight.
ReplyDelete
Replies
jusbladApril 20, 2011 at 7:35 AM
Excellent work Neil, great intro. I'm amazed at the reliability rate you got (92% I think).
It would be great if you offered a link to the model, as it's a bit hard to read on the video.
ReplyDelete
Replies
Neil McGuiganApril 20, 2011 at 9:50 AM
Hi jusblad. I am working on video 6 in this series, and will post the data and process soon.
ReplyDelete
Replies
Ias NaibahoApril 28, 2011 at 9:12 PM
Quoting Ingo words: "This can be ensured by storing the word list of the training process together with the learned model and give the stored word list to the preprocessing of the new texts as well. This will do the trick."

How can I do that? I have the same problem.. I trained data which already labeled "positive" and "negative".. and I assume that as a train set.. I already had a test set and I want to apply the model to that test set.. how to do that? Really need ur help immediately :(

and one more thing, do you have a tutorial for feature selection in this rapidminer? Kindly reply this comment or send ur reply to my email ias.naibaho@gmail.com .. thx a bunch :)
ReplyDelete
Replies
Neil McGuiganApril 29, 2011 at 10:41 AM
@ias Working on video 6 that explains your first question.

I have a video on youtube about feature selection here: http://www.youtube.com/watch?v=JlhoTAk1ow8
ReplyDelete
Replies
Ias NaibahoMay 2, 2011 at 12:25 AM
dear Neil,

I tried to watch the link but I think the video is unfinished yet..? pls check it again.. thx b4 :)
ReplyDelete
Replies
VinayAugust 11, 2011 at 3:54 AM
Hi Neil,

I tried running the same analysis for my table which has similar two fields as shown in video. like category (in you DB , which is a text and the label for my model) and Usr_Text (jobtext in your case) , but it keeps throwing error showing error message.
"Process failed: The operator k-NN does not have sufficient capabilities for the given data set: polynominal attributes not supported".

In order to fix this issue, I have used a "Numerical to Polynomial" operator between "Process Document From Data" and "Select Attribute" operators. I have also modified the settings for K-NN operator to the following as I couldn't retain the settings for K-NN as said in the video because it was throwing an error " The operator k-NN does not have sufficient capabilities for the given data set: polynomial attributes not supported..."

Measure types --> Nominal Measures
Nominal Measure --> Nominal Distance

I would like to also say that in the "Set Role" operator, as you had entered category for Name parameter in the video, I have an attribute called 'negative'

When I execute the process, there is criterion selector at left hand side in the Output view showing accuracy and kappa. Upon selecting the accuracy and Table View, it displays the output in the tabular format but i don't see the category wise o/p as displayed in the video. It display pred 0, pred 0.707, pred 0.396 in the column and true 0, true 0.707, true 0.396 in the column with some values.

Pls help to fix the issue.

Regards,
Vinay
ReplyDelete
Replies
Nuno AACostaAugust 29, 2011 at 9:20 AM
Hello! Thanks for you support, it is really helpful.

Can you explain how can I see the documents that have been bad-classified? I only know the results via confusion matrix, I would like to see exactly which classification were wrong! Is that possible?
Many thanks,
Nuno
ReplyDelete
Replies
AndreaSeptember 13, 2011 at 5:52 PM
Hi I am trying to label a yes/no field based on text obtained from a get pages process and some known values. The data comes from an excel file, the problem is that on the set role the only available name is text and not the yes_no field value. Basically I can only change the role of one attribute(text describing a website) and assign it as a label. What I need to achieve is a completely filled in column of yes/no based from an incomplete list of yes/no.

do you think that the problem could be from reading from an excel file rather than a mysql db?
ReplyDelete
Replies
ParianFebruary 6, 2012 at 12:49 AM
Hi Dear Neil.
thanks a lot for your very useful tutorials on text mining using rapid miner. I have done all of the previous parts(1-5) using my data set in spss(.sav format). In 6th part I tried to do it again. in this case I created two variables, one of them is jobText and another one is Category. but when I run the process there is an error corresponding to "Set Roles" component.It keeps saying:"The attribute 'Category' doesn't exist!" and so it doesn't run. I don't know why the program doesn't recognize this variable. even when I make a break point before "Set role" component, and I run the program, unlike you I can't see any Category variable. I mean all the variables are just tokens that are extracted from text. Do you have any idea what should I do? I also tried to find your email address but I didn't manage to find it ! so I was compelled to write all this long story here.
I'd be very grateful if you answer me.thanks a lot.
ReplyDelete
Replies
Neil PatelMarch 12, 2012 at 9:10 AM
hi Mr. McGuigan,
I wanted to ask you if you can show us a video or even post your excel dataset for the document classifier. Maybe even a trimmed down version of it so we know what you mean by "label." Where would this field be located?

Thanks,

Neil
ReplyDelete
Replies
Neil McGuiganJune 21, 2012 at 12:20 PM
@ Neil, there's a link to the excel sheet in the comments of video 1

A Label is the column that you are trying to predict. For example, if you are trying to predict the "category" of a document, then category is the label. It is equivalent to the "Y" in a regression.
ReplyDelete
Replies
LearningRiskJuly 18, 2012 at 7:14 AM
I have web mining clicked in managed extensions but the processes do not appear
ReplyDelete
Replies
AnselmoJuly 19, 2012 at 2:01 AM
Hi Neil,
I need to categorize a lot of web sites and I want to use their "meta keywords" tag as predefined categories in training data. i.e YouTube has "video, sharing, camera phone, video phone, free, upload" keywords. But as I understood, we can use only one category for a document. Could you please suggest your thoughts on this case. How can we categorize documents when in training data we have many categories for each row, not only 1 category.
ReplyDelete
Replies
AnonymousSeptember 24, 2012 at 5:57 AM
Hi Neil,
I found these tutorials really helpful while working on a project on text mining.
I want to know how could I create my own word list in rapidminer-5 and use it (only these words) as an input to the operator "Process Documents from files".
Please reply asap.

Thanks,
Vipul
ReplyDelete
Replies
AnonymousJanuary 8, 2013 at 9:25 AM
Hi Neil,
I followed your steps, but my K-NN classification vector accuracy is being reported as 0% at the end of the process. Do you have any idea why it would be so?

Thanks,
-Jai
ReplyDelete
Replies
AnonymousJanuary 22, 2013 at 10:31 PM
thanks for videos
ReplyDelete
Replies
ramJanuary 22, 2013 at 10:50 PM
hello sir,
i am ram, i need to find the semantic similarity between words through wordnet(Wu and palmer similarity measure) for my project in order to cluster the documents(wsdl files)..
Could u please suggest your thoughts....
ReplyDelete
Replies
AnonymousFebruary 20, 2013 at 5:21 AM
Hello, thank you very much for this amazing videos. I am new to data mining, so I do apologize for this basic question. I am wondering how to or where to get data for learning. I do not have any database as many of you have. Is there any source for example with positive/negative feedback messages that can be used for model creation? Thanks.
ReplyDelete
Replies
UnknownMarch 16, 2013 at 7:13 AM
Hello Neil,

I must thank you for these videos. They have helped me no end in my University paper!

The only improvements I can say is there needs to be a little more information about how the number of validations relates to the data set (As in if I put 10 into the box, would this then use ten documents for validation or 10% of all the documents). And to go a little bit more into how "Set Rule" words. It took trial and error to get that working in my case.

But I must say, this is far more than I could have possibly hoped for help. Thank you again for taking the time to make these videos and sharing them!

ReplyDelete
Replies
MoohebatMarch 19, 2013 at 5:33 AM
Hi Thank you very much. I had a question from you. I did all these steps correctly. Rapidminer created AUC diagram for me but it takes my positive class by wrong. How can I change positive and negative class together. Now it select positive and negative class automatically. Where we can identify which one is positive or negative? Thanks.
ReplyDelete
Replies
UnknownMarch 30, 2013 at 7:14 AM
@Neil Great tutorials you have there;
I have just started with rapid miner; and I am still exploring its capabilities; if you can help me understand something...

In order to apply any of the algorithms K-nn or Naive; should the data be stored in a form of table? or it can be directly supplied as text files?

Cheers
ReplyDelete
Replies
UnknownNovember 2, 2013 at 1:18 AM
@Neil Thank you for this tutorial. It's really helpful.

But I have a question, I'm new to RapidMiner and I want to use Naive Bayes to classify my documents into 7 categories. I have a list of words for each category that I want to use as classifiers to determine the category of the document. I have followed your tutorial up to part 6 but I am getting very low accuracy.
Hope to hear from you soon! :)
ReplyDelete
Replies
AnonymousDecember 13, 2013 at 5:46 AM
Hi,

Can we use the decision tree for text auto-classification? If so, what should we consider in the model?
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Friday, November 12, 2010

Text Analytics With Rapidminer Part 5 of 6 - Automatic Document Categorization

43 comments:

Archive