Pages

Saturday, October 8, 2011

Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents

After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.

I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn't seen before. It correctly labels 11 of the 12 documents.



Files from the video.

19 comments:

  1. Hello!
    I'm a big fan of your tutorials and it has been helping me a lot on my text mining thesis.
    I wanted to do the text analyze you do on your second video (tokenize, remove stopwords, filter by lenght) ... But I wanted to maintain my sentece as a final result. I only get words and their frequency. I want to get a clean sentece as a final result (with stop words and etc).
    Can you help me?

    ReplyDelete
  2. Hi Fernanda,

    I'm glad you liked the tutorials - I enjoy doing them.

    Make sure your text processing extension is up-to-date (top menu > help > update rapidminer)

    In the tokenize operator, set Mode to Linguistic Sentences. That will tokenize the document on the sentences.

    Hope that helps, and good luck

    ReplyDelete
  3. Mr. McGuigan,

    I'm in need of some help with Rapidminer in a short amount of time and need help. I am willing to pay you for your services. Would you be willing to help me? Can we find a medium to discuss the matter in private?

    ReplyDelete
  4. Hi Yohel,

    if you google: neil mcguigan sauder, my email address is on the first link. you can contact me there.

    Regards,

    Neil

    ReplyDelete
  5. Hi Neil,

    Thanks a ton for the great video tutorial. I'm trying to repeat your steps in this video, but somehow "store wordlist", "get wordlist" operators no longer exists.
    I'm using RapidMiner 5.1.011 and i made sure it's up to date.

    ReplyDelete
  6. It looks like Write/Read(io type: wordlist) operators can do the trick.

    ReplyDelete
  7. Great example. Now, I'm wondering, how'd you do that with text coming from a database? I've succesfully created the classifier, but I got a per engram list, instead as a per id result,as you...

    ReplyDelete
  8. Hi Neil,

    I successfully followed your example here. Now: how can I extend this process by adding one more level of categorization and some sentiment analysis? Is it difficult? Thanks!!!

    ReplyDelete
  9. Hi, thank you very much for the great tutorials!

    I followed successfully the steps of this tutorial and it works perfectly. I tried to classify Greek news documents and the results are very promising. i have one question: for the most obvious categories like sports, politics etc, the algorithms work very well using words and their frequency. But if i want to make a more complex analysis, is it possible to use successfully the "tokenize" operator (using Mode to "Linguistic Sentences") for Greek language? Greek is not available in the list and if i choose "English" i get an error. Can i customize it?

    Also, I'm willing to include as classification criteria some word clusters or collocates (and their frequency) and not only separate words. Is it possible?

    Thanks!

    Giannis

    ReplyDelete
  10. Hi

    I have a question where are the operators (store word) and (store model) I can't find them

    one more question :
    when I define a new sql server connection and when I press on the test button this message is shown next:

    SSO failed: naitve SSPI library not loaded check the java.library.path system property

    can you help


    thank you

    ReplyDelete
  11. Neil, thanks very much for developing these videos. These are as useful as the tutorials in the Gary Miner, Dursun Delen, et al textbook I just bought for $100 which mostly covers the commercial packages. Thanks for helping to open up this field to the newbies.

    ReplyDelete
  12. I am John Morales, a journalist based in Manila, Philippines. Currently, I am working on a story on the performance of our lawmakers. I would like to investigate how each of our 200-plus lawmakers are performing based on three criteria: legislation, attendance and participation in parliamentary proceedings.

    I have already generated data on legislation (which looks at rate of the bills they file becoming a law or passing the third reading) and attendance. The last indicator proves to be hard to generate.

    To assess whether a lawmaker is participating in plenary sessions, my primary source of document is the congressional record which details the activities done by lawmakers in a plenary session. Each record indicates the activities for one whole-day session.

    I have 84 text files. This means that it covers 84 plenary sessions. I have removed the part where names of lawmakers appear but not necessarily participating in plenary debates such as invocation delivered by lawmakers or the secretary general reading the title of the bills authored by lawmakers. This is to remove the bias that our method will create.

    I would like to use RapidMiner to process the text, so I can look at the total occurences of the names in the records. So for instance, if the the name occurs in 84 plenary sessions, that lawmakers is actually participating 100% in parliamentary proceedings.

    I have been trying to follow the process, but failed many times. There were times that I could not see the occurences of texts. Plus, I could not find a way to search the text that I would like to look. I am also having problems with generating reports.

    I hope you could help me in my endeavor.

    If it is okay, this is my email: alliage.morales@gmail.com

    I would appreciate if you could look at my files first hand.

    Yours respectfully,
    John Morales
    Researcher
    GMA Network
    Philippines

    ReplyDelete
  13. hi,
    i am working on social media analytics,very new to Rapid Miner, can u guide me how to do sentiment analysis in Rapid Miner

    ReplyDelete
  14. Really enjoyed your first five tutorials on text mining using Rapid Miner. If you could take a minute to post a zipped sample data file to use with this (6th) turorial, as you did in a post for your first one, it would be very much appreciated. Thanks.

    ReplyDelete
  15. Hi Neil,

    I have a quick question. Is it also possible to "review" a model, for determining which elements the model uses to determine to which category a text should be allocated?

    Thank you very much & I have really enjoyed your tutorials on text mining!

    With kind regards,
    Lennart

    ReplyDelete
  16. Amazing tutorials on text mining with rapid miner! I have a question though: I have two folders each containing only category1 or category2 text files, I use those folders as a teaching sample for the model. Then I apply the model to the third folder with undefined files, everything learned from your tutorials. But in my undefined folder there might be also files which don't belong neither to category 1 nor to category 2. How do I deal with these files? Is there an option to limit classification only to certain percentage of similarity and abandon those files that don't fit in?
    Thank you very much! I will appreciate if you reply, however even if you don't your tutorials still were a great help!!!
    Please don't stop making them!))

    ReplyDelete
  17. Amazing tutorials on text mining with rapid miner! I have a question though: I have two folders each containing only category1 or category2 text files, I use those folders as a teaching sample for the model. Then I apply the model to the third folder with undefined files, everything learned from your tutorials. But in my undefined folder there might be also files which don't belong neither to category 1 nor to category 2. How do I deal with these files? Is there an option to limit classification only to certain percentage of similarity and abandon those files that don't fit in?
    Thank you very much! I will appreciate if you reply, however even if you don't your tutorials still were a great help!!!
    Please don't stop making them!))

    ReplyDelete
  18. Hello! Thank you very much for tutorials on text mining, they were super helpful. I've got a question: I have two folders with category1 texts and category2 texts separated. I use this as a training sample for the model. Then I apply it to the Undefined folder (with texts of the same format which are not categorized yet). Everything as in your tutorial 6. But my files in Undefined folder don't necessarily belong to the category1 or category2, they might just be out of categories. How can I put a constraint for the program to discard (to leave undefined) those files that have confidence less than let's say 0.6?
    Thank you!

    ReplyDelete