Vancouver Data Blog by Neil McGuigan: Text Analytics with RapidMiner Part 6 of 6

Saturday, October 8, 2011

Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents

After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.

I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.

Files from the video.

19 comments:

FernandaOctober 22, 2011 at 10:50 PM
Hello!
I'm a big fan of your tutorials and it has been helping me a lot on my text mining thesis.
I wanted to do the text analyze you do on your second video (tokenize, remove stopwords, filter by lenght) ... But I wanted to maintain my sentece as a final result. I only get words and their frequency. I want to get a clean sentece as a final result (with stop words and etc).
Can you help me?
ReplyDelete
Replies
Neil McGuiganOctober 24, 2011 at 7:21 PM
Hi Fernanda,

I'm glad you liked the tutorials - I enjoy doing them.

Make sure your text processing extension is up-to-date (top menu > help > update rapidminer)

In the tokenize operator, set Mode to Linguistic Sentences. That will tokenize the document on the sentences.

Hope that helps, and good luck
ReplyDelete
Replies
Yohel DiazOctober 28, 2011 at 6:49 PM
Mr. McGuigan,

I'm in need of some help with Rapidminer in a short amount of time and need help. I am willing to pay you for your services. Would you be willing to help me? Can we find a medium to discuss the matter in private?
ReplyDelete
Replies
Neil McGuiganOctober 28, 2011 at 7:01 PM
Hi Yohel,

if you google: neil mcguigan sauder, my email address is on the first link. you can contact me there.

Regards,

Neil
ReplyDelete
Replies
Ryan SunNovember 8, 2011 at 4:51 PM
Hi Neil,

Thanks a ton for the great video tutorial. I'm trying to repeat your steps in this video, but somehow "store wordlist", "get wordlist" operators no longer exists.
I'm using RapidMiner 5.1.011 and i made sure it's up to date.
ReplyDelete
Replies
Ryan SunNovember 8, 2011 at 5:46 PM
It looks like Write/Read(io type: wordlist) operators can do the trick.
ReplyDelete
Replies
AnonymousNovember 18, 2011 at 10:37 PM
Thanks
ReplyDelete
Replies
EstebanJanuary 10, 2012 at 7:48 AM
Great example. Now, I'm wondering, how'd you do that with text coming from a database? I've succesfully created the classifier, but I got a per engram list, instead as a per id result,as you...
ReplyDelete
Replies
FrancescoJanuary 17, 2012 at 10:03 AM
Hi Neil,

I successfully followed your example here. Now: how can I extend this process by adding one more level of categorization and some sentiment analysis? Is it difficult? Thanks!!!
ReplyDelete
Replies
g.anagnosApril 8, 2012 at 1:00 PM
Hi, thank you very much for the great tutorials!

I followed successfully the steps of this tutorial and it works perfectly. I tried to classify Greek news documents and the results are very promising. i have one question: for the most obvious categories like sports, politics etc, the algorithms work very well using words and their frequency. But if i want to make a more complex analysis, is it possible to use successfully the "tokenize" operator (using Mode to "Linguistic Sentences") for Greek language? Greek is not available in the list and if i choose "English" i get an error. Can i customize it?

Also, I'm willing to include as classification criteria some word clusters or collocates (and their frequency) and not only separate words. Is it possible?

Thanks!

Giannis
ReplyDelete
Replies
UnknownJune 3, 2012 at 6:08 AM
Hi

I have a question where are the operators (store word) and (store model) I can't find them

one more question :
when I define a new sql server connection and when I press on the test button this message is shown next:

SSO failed: naitve SSPI library not loaded check the java.library.path system property

can you help

thank you
ReplyDelete
Replies
Bill MJune 10, 2012 at 12:54 PM
Neil, thanks very much for developing these videos. These are as useful as the tutorials in the Gary Miner, Dursun Delen, et al textbook I just bought for $100 which mostly covers the commercial packages. Thanks for helping to open up this field to the newbies.
ReplyDelete
Replies
UnknownJune 26, 2012 at 7:13 AM
I am John Morales, a journalist based in Manila, Philippines. Currently, I am working on a story on the performance of our lawmakers. I would like to investigate how each of our 200-plus lawmakers are performing based on three criteria: legislation, attendance and participation in parliamentary proceedings.

I have already generated data on legislation (which looks at rate of the bills they file becoming a law or passing the third reading) and attendance. The last indicator proves to be hard to generate.

To assess whether a lawmaker is participating in plenary sessions, my primary source of document is the congressional record which details the activities done by lawmakers in a plenary session. Each record indicates the activities for one whole-day session.

I have 84 text files. This means that it covers 84 plenary sessions. I have removed the part where names of lawmakers appear but not necessarily participating in plenary debates such as invocation delivered by lawmakers or the secretary general reading the title of the bills authored by lawmakers. This is to remove the bias that our method will create.

I would like to use RapidMiner to process the text, so I can look at the total occurences of the names in the records. So for instance, if the the name occurs in 84 plenary sessions, that lawmakers is actually participating 100% in parliamentary proceedings.

I have been trying to follow the process, but failed many times. There were times that I could not see the occurences of texts. Plus, I could not find a way to search the text that I would like to look. I am also having problems with generating reports.

I hope you could help me in my endeavor.

If it is okay, this is my email: alliage.morales@gmail.com

I would appreciate if you could look at my files first hand.

Yours respectfully,
John Morales
Researcher
GMA Network
Philippines
ReplyDelete
Replies
AnonymousAugust 26, 2012 at 9:01 PM
hi,
i am working on social media analytics,very new to Rapid Miner, can u guide me how to do sentiment analysis in Rapid Miner
ReplyDelete
Replies
AnonymousNovember 6, 2012 at 3:53 AM
Really enjoyed your first five tutorials on text mining using Rapid Miner. If you could take a minute to post a zipped sample data file to use with this (6th) turorial, as you did in a post for your first one, it would be very much appreciated. Thanks.
ReplyDelete
Replies
Lennart BorstJanuary 16, 2013 at 5:41 AM
Hi Neil,

I have a quick question. Is it also possible to "review" a model, for determining which elements the model uses to determine to which category a text should be allocated?

Thank you very much & I have really enjoyed your tutorials on text mining!

With kind regards,
Lennart
ReplyDelete
Replies
UnknownFebruary 18, 2014 at 11:05 AM
Amazing tutorials on text mining with rapid miner! I have a question though: I have two folders each containing only category1 or category2 text files, I use those folders as a teaching sample for the model. Then I apply the model to the third folder with undefined files, everything learned from your tutorials. But in my undefined folder there might be also files which don't belong neither to category 1 nor to category 2. How do I deal with these files? Is there an option to limit classification only to certain percentage of similarity and abandon those files that don't fit in?
Thank you very much! I will appreciate if you reply, however even if you don't your tutorials still were a great help!!!
Please don't stop making them!))
ReplyDelete
Replies
UnknownFebruary 18, 2014 at 11:06 AM
Amazing tutorials on text mining with rapid miner! I have a question though: I have two folders each containing only category1 or category2 text files, I use those folders as a teaching sample for the model. Then I apply the model to the third folder with undefined files, everything learned from your tutorials. But in my undefined folder there might be also files which don't belong neither to category 1 nor to category 2. How do I deal with these files? Is there an option to limit classification only to certain percentage of similarity and abandon those files that don't fit in?
Thank you very much! I will appreciate if you reply, however even if you don't your tutorials still were a great help!!!
Please don't stop making them!))
ReplyDelete
Replies
UnknownFebruary 18, 2014 at 8:12 PM
Hello! Thank you very much for tutorials on text mining, they were super helpful. I've got a question: I have two folders with category1 texts and category2 texts separated. I use this as a training sample for the model. Then I apply it to the Undefined folder (with texts of the same format which are not categorized yet). Everything as in your tutorial 6. But my files in Undefined folder don't necessarily belong to the category1 or category2, they might just be out of categories. How can I put a constraint for the program to discard (to leave undefined) those files that have confidence less than let's say 0.6?
Thank you!
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Saturday, October 8, 2011

Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents

19 comments:

Archive