I'll be releasing a new video on text mining with RapidMiner every day this week.
They're all about 10 minutes long, and go into a fair amount of detail, and should be easy to understand. Your feedback is appreciated!
Here is the first one. It's about loading text into RapidMiner in a variety of ways. From copy and paste, to HTML files, to database reads.
*NOTE: You may need to use the Nominal To Text operator to turn your text field into a field that RapidMiner understands as "text". It's under Data Transformation, Type Conversion.
Later this week:
Tuesday: Processing Text in RapidMiner - tokenizing, stripping HTML, stemming, stopwords, n-grams, and word frequency tables.
Wednesday: Association rules with text in RapidMiner - making word vectors, finding frequent item-sets and high-confidence association rules in text documents.
Thursday: Finding similar documents: how to automatically calculate the similarity between documents. TF-IDF, cosine similarity and K-Means clustering are covered.
Friday: Automatic classification: How to classify documents into classes (like positive/negative reviews, or spam/not spam or sports/finance/leisure news), and which words are important.
NEW: Applying A Model To New Documents
Hope you enjoy them.
See my other data mining videos here
thank you very much for the tutorial!
ReplyDeleteGood start to this set of series Neil!
ReplyDeleteWOW!!!
ReplyDeleteI've been working with Rapid Miner for two mounths already to make a positive/negative classification... It's been so hard as there was no help in text mining :(
And you just posted a full scenario!!!
Can't wait till friday!
Thanks a lot!
@Anonymous. No worries. I couldn't find a lot of in depth text mining videos either, which is why I decided to make 'em. Good luck with your project. Let us know more about it when you're done!
ReplyDeletewe need link where dataset found
ReplyDeleteHere is the sample data.
ReplyDeleteIt's in zipped xls format
Neil
Great Video!
ReplyDeleteCan you tell me the process of exporting my rapidminer result into an excel file? I have more than 500,000 records in my result.
I would greatly appreciate your response.
You can use Write Excel as Neil points out, but if you've already created the table and it took a while to run, you probably don't want to re-run with the Write Excel operator just to save the data. Unfortunately, there is no built-in support for exporting the table to excel in the free version (there's a post to this effect somewhere on the rapidminer forum), but you can copy and paste the entire table into Excel (you'll have to get the headers some other way, though, because I don't know how to copy them). --Pat
DeleteHi Neil,
ReplyDeleteI have all the IT customer feedback information for various cases in one excel sheet.If I want to do few operations like tokenizing,stemming etc, what would the root operator which can read the data from excel. Hope you got my question. Please ask me if you do not get it. Early response would be really helpful.
Thanks RK.
@RK, sorry for the delay, was out of town.
ReplyDeleteYou should be able to use the Read Excel operator (use the search function to find it). You can only use .xls files, and not .xlsx.
@Vani, you can use the Write Excel operator to output your results to Excel. It may be too large for Excel to handle though, in which case you may want to consider a database, such as the free MS SQL Express
ReplyDeleteCan you provide a simple guide on how to configure the write excel operator, because I can't get it to work. Thanks.
DeleteThank you very much for these videos on text analytics! They are not only informative, but you have made them easy to follows! You were looking for input on videos for either web crawling or web scraping. I would like to put my vote in for web scraping, although I can see how web crawling would be useful first. Thank you again.
ReplyDeleteMichael Kahler
Thank you very much for this tutorial. I have been trying to extract information from texts and I am not being able to do. Actually, I wanted to extract protein names from the biological full text articles. Can you give me some hint on how to do that. I would really appreciate. Is there any text plugins for rapidminer to perform extract of words from the full-text articles?
ReplyDelete@DMX check out the information extraction plugin for rapidminer. I believe there is a link to it on the RapidMiner forum.
ReplyDeleteHi neil,
ReplyDeletei would like to procees twitter messages and my dataset was in excel. but i could not fine any video for further guidence. I tried to explore myself but it does'nt work. Do you have any note/video on that? any help is much appreciated. Thank you
Hi Neil,
ReplyDeleteWhen I tried was a process document from files operator for more than one file in the results is only a file handling. How can I solve this problem? Thank you...
Hi;
ReplyDeleteI've gone through your videos on RapidMiner's Text Mining capabilities and found therm very interesting. Agility is currently under review of different systems that provide Text Analytic capabilities. We are reviewing a couple; one of them being the Calais system. This system has an example application (http://viewer.opencalais.com/) that demonstrates some of its caapbilities. I was wondering if you are familiar with Calais and if you felt it was comparable to Calais with respect to the type of outpuyt generated from the Calais Test Application and RapidMiner?
Peter
plz extend video time from 10 to 15 minutes but be a little slow so that we will be able to follow you as beginners.
ReplyDeleteHi Neil,
ReplyDeleteDo you have any video on "Clustering" through Rapid Miner?
Regards
Gunjan
@gunjan video 4 briefly discusses clustering
ReplyDeleteHi Neil, nice tut. I am working with the course co-training dataset (http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-51/www/co-training/data/). But when i use the operator "Process document for files" for loading the pages that are in the folder it show me an error 'The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen..
ReplyDelete'
do you have any suggest?
hi neil,
ReplyDeletei tried to enter some text using your suggestion for create document - shows only code. don't know what i am doing that is incorrect. have just downloaded rapidminer so your videos are really a blessing. keep up the good work please
Very useful videos for beginners. Cannot thank you enough
ReplyDeleteHi, Neil
ReplyDeleteYour videos about text mining with rapid miner is really good, but i am from China and it is difficult to access youtube since it is blocked in China. Would you please share these videos with me? my gmail account is zhxupc@gmail.com.
Thanks,
Xiao
So I was trying to do the association learning example on wikipedia with the bread, butter, milk and beer example set and I got it to work..just not by using excel as my imported data source. I had to create 5 different text files with each customer's grocery list. When I tried to create a table in excel (first column=customer's name, second column- sale1, third column sale2, fourth column sale 3, fifth column sale4 | First row= labels, Rows 2-6 are each customer's sales) I could not create a binary table of true and false values in rapidminer. I think I was able to import my data correctly because when i put a break point after the process documents operator and examined it, it looked exactly like how it looks in excel. I was wondering if you could do it perhaps tell me what i'm doing wrong. The only thing I can think of is that i'm splitting up the data in excel across too many columns.
ReplyDeleteKindest,
Neil P
Hi, your videos are great. I was wondering if you have any idea how to combine Google Analytics reports with text mining techniques that you present here. Thanks!
ReplyDeleteHi, your videos are a great resource.. I'm working on Sentiment Analysis, where I have a text/sentence which is labeled as positive, negative, mediocre, in CSV format.. I'm applying Process document(transform case, tokenize, stopwords, stemming, and n-gram) --> X-validation -->Using Naive Bayes -->Apply model -->Performance(classification)..
ReplyDeleteMy problem is that the above process gives me the prediction, whereas I'd like to get the probability distribution of text given category.. Is there any other operator that I can use to calculate probability distribution for each category(positive, negative, mediocre).. Also, the cross entropy and log loss measures of performance, I'm getting infinite..
Any pointers?? I appreciate your help..
Use the Naive Bayes Process
DeleteThis comment has been removed by the author.
ReplyDeleteHi Neil, a few weeks ago I tried to replicate the example you presented in the video 3 (Association rules with text) from an Excel data source which contained the comments of people in each cell (first column), but when create table does not apply binary operator tokenize. also try it from segmenting each comment in a excel file and there it works. would greatly appreciate if you help me solve this complication.
ReplyDeleteMuch better then training from Rapidminer guys:-(..Thanks a lot Neil..
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi Neil,
ReplyDeleteThanks a lot for the videos, watched them all and they are very concise, to the point and really helpful for beginners to get started quickly with the tool. I was wondering if you can give me some additional hints regarding text clustering. This area is often side-lined in ML since it’s unsupervised. (working with Weka at the moment. )
I am trying to cluster a whole set of definitions of text using K-means, for ex: flu - common symptom of cold etc etc. I have a couple of thousand definitions like this. The idea is that I present a document or abstract and based on the clusters I created from the definitions, it decides what the document is predominantly talking about by seeing to which cluster it falls closest too. Like you suggest in the video, the process is to use the bag of words, filter for stop words and create vectors. However being an unsupervised method I’m not quite sure how to go about it.
Again your videos are great.
Thanks,
Daniel
Hi DMAN,
DeleteGlad you liked the videos. If you check out video 4, it shows how to do text clustering. Also, the upcoming RapidMiner book has some good stuff on clustering as well.
Cheers,
Neil
Hey Neil,
DeleteThank you for the reply.
I looked at all the videos, you have no idea how much they helped me. Just asked the question, to get over the top the normal functionality, so to speak. Clustering techniques are a bit shadowed, especially evaluation measures that give you a real understanding of the quality of results.
What book are you talking about please...
regards,
Daniel
Hi Neil!
ReplyDeleteThanks for the videos.
I'm new in using the RapidMiner, so I want to classify the customers' reviews in the online-shops. I think RapidMiner fits well. What can You suggest? How can I start, because I tried to copy the reviews as a text, but it doesn't have the desired result. Please help! :)
Thanks.
Hi Neil
ReplyDeleteThanks for the videos as they are quite helpful.
I am quite new to rapidminer and i am working on a project in which i am working on text clustering with k means algorithm. Can you please suggest me how to do this in rapidminer. Can i use database in excel for text clustering with k means
Regards
Neil God bless you, nice video ever.very helpful ...Neil What about wordnet Extension could you help me to figure it...i need to use it to apply clustering with synonym word inside document.
ReplyDeleteHi Neil
ReplyDeleteI'm doing a project in rapid miner and I'm trying to connect these three operators : read excel-get pages-database. It should be easy, but I'm getting some errors like "Write Database....java.lang.NullPointerException: Identifier must not be null" or MySQL though an error exception. I'm quite new in these filed so I don't know where I did wrong:on my database setup-connection or is given my this error because all the data that I'm trying to get is form www.daft.ie and probably there server is stopping me.
Can you help me with? Do you have any ideas that could guide me?
Many thanks.
Hi Neil,
ReplyDeleteThanks for the great videos :). I have a question which I hope you can answer.
When reading from csv file the column ID and column MESSAGE, is it possible to keep the ID field when using the Process Document from Data operator? So when tokonize keep the relation between the word and sentence? Thanks!
Hello Neil,
ReplyDeleteI am trying to extract certain words on the basis of wordlist Dictionary i created.
But somehow i am unable to do it.
Can you please suggest how to do this in RapidMiner?
Thanks!!!!
Hello,
ReplyDeleteI am using Process documents from files operator.
Is there any way to extract the tokens which match one of the words in the list of words in a wordlist?
Neil,
ReplyDeleteAfter completing different processes of text mining, I get 10,000+ tokens in the output from 500 documents. I am trying to export them and create a database. Can you please help me by suggesting the steps to export the process results in a database / excel sheet?
Thanks a lot.
@NEIL
ReplyDeleteHi
i am working on the Extraction and analysis of faculty performance of management discipline from student feedback project. can you please help me via suggest me about the methods that gonna help me to extract the data and tokenization. please help me i am in hurry....
Hello Neil,
ReplyDeleteThanks for all your videos and help. You're great! :) I've got a question concerning the "Process Documents from Data" operator. I want Rapid Miner to open downloaded html files on my hard disk and to process them. I let it "Read a CSV" file that contains about 50 file paths of the html-files I'd like to process. That works well but it doesn't open the files in the CSV to process their content. Is there any possibility to make Rapid Miner open multiple file paths (taht are not in the same directory), read the html-files and process them?
I would be very thankful if you could give me some advice.
Best regards,
Enrico
Hello M. McGuigan,
ReplyDeleteThanks for the great videos ! Very, very, very helpfull!!! One short question: I'am doing k-means clutering from I00objects that were created using the «cut document» operator. When looking at the results in the exampleset sheet,the column «text» only shows the tokens used for classification process , but not the integral text of the objects. Is there someting I can do to have acces to the integral text of items from each cluster ?
Thanks for your support
John D.
Hello, I want to check similarity between 2 files, which one the operator that best to use? data to similarity operator or cross distance operator?
ReplyDeleteThanks
Hi Neil,
ReplyDeleteI feel great have your guidance for using rapid miner to processing text, but due to the new version for rapid miner, i can't able to process the document data by using tokenize. May i know what is the problem?
The new version of process document data operate doesn't have "create word vector".
Thank you so much Neil, this video is so helpful!
ReplyDeleteHi,
ReplyDeleteThanks for this. Part 6: "Applying A Model To New Documents" on YouTube should be included in the text analytics playlist on that site. Right now, it is not. That means you have to either find it or stumble across it separately on your YouTube channel. I only now discovered that there even was a Part 6 to this tutorial series due to that. I think for ease of use, you should fix this. I know that Part 6 was done after the fact, but it naturally belongs with the rest of the series.
Other than that small issue, these tutorials have been great and very helpful! Thank you!
Wau, thanks a lot for this tutorials. I'm new to RapidMiner and this was exactly what I was looking for on the internet.
ReplyDelete
ReplyDeleteThanks for your excellent guide man
Data Mining
How to export the results in excel format.....
ReplyDeleteVery, very useful! Thank you for the effort!
ReplyDeletei am new to rapid miner but i have installed rapid miner in windows 8 in that i don't have update rapid miner so that i can update text processing and web mining i have only update rapid miner marketplace how can i update text processing and web mining
ReplyDelete