Vancouver Data Blog by Neil McGuigan: Text Analytics With Rapidminer Part 2 of 6

Tuesday, November 9, 2010

Text Analytics With Rapidminer Part 2 of 6 - Processing Text

Wow, several hundred hits yesterday, thanks for watching everyone!

This is part two of a six-part video series on text mining in RapidMiner. This video describes how to process text to get a word frequency table. Topics covered include:

reading hundreds of documents from a database into RapidMiner
stripping HTML from content from a popular job posting board
tokenizing a document (splitting it into words)
removing overly common words (stopwords)
finding roots of words (stemming)
finding phrases in documents (n-grams)
and generating a word frequency table that describes important words in your documents

If you're not familiar with RapidMiner, you can see my other videos on my Youtube Channel.

Here's the video:

Your feedback and sharing is appreciated!

Up tomorrow, finding association rules in documents.

37 comments:

AnonymousNovember 9, 2010 at 4:32 PM
Great job! Thank you!
ReplyDelete
Replies
AnonymousJanuary 21, 2011 at 10:40 AM
Neil, I have done everything as described in the video from a Create Document and Process Document and I do not get the word list as shown. I followed each of your steps. What am I doing wrong?
ReplyDelete
Replies
Neil McGuiganJanuary 21, 2011 at 11:58 AM
I am not sure. Try to simplify the process as much as possible. Also use the Debugging facility, where you can put breakpoints before or after each operator. Right click an operator to add a breakpoint, then view the results when you run the process.
ReplyDelete
Replies
AnonymousJanuary 21, 2011 at 2:06 PM
I am getting no errors. If I start simply with the Create Document I can do the pass thurs run the process and see the text so I know it reads. It is when I add the "Process Document" and connect the pass thurs and run the process that no values are generated in the "Word List" and no errors are showing. Here are some process logs. The first one concerns me is it a possible problem?

Jan 21, 2011 2:00:27 PM INFO: No filename given for result file, using stdout for logging results!
Jan 21, 2011 2:00:27 PM INFO: Loading initial data.
Jan 21, 2011 2:00:27 PM INFO: Process //NewLocalRepository/New Test/New Test Process starts
Jan 21, 2011 2:00:29 PM INFO: Saving results.
Jan 21, 2011 2:00:29 PM INFO: Process //NewLocalRepository/New Test/New Test Process finished successfully after 1 s

Here is Output in the Results Overview.
Contains 0 entries.
This list has been created on 1 documents assigned to 1 labels

Thanks for videos thay are great.
ReplyDelete
Replies
AnonymousJanuary 24, 2011 at 5:09 PM
Neil, I found a means of creating a wordlist. I had to add Wordlist to Data after the Process Documents. All is well for now.
ReplyDelete
Replies
UnknownMarch 15, 2011 at 9:23 PM
Neil,

Thanks for the great videos. I am a manager of an internal IT help desk. We're wanting to do some trend analysis on the trouble calls that we receive so that we can perform root cause analysis to eliminate the root problems from our environment. I have a single column Excel file that I exported our work order database which contains the work order descriptions. I was hoping that I could use RapidMiner to to analyze the data to find patterns. Basically I need to compare the records to themselves to see if any trends exist. Unforutnately, I'm dealing with unstructured data that can differ depending on who entered the work order in the first place. Each row in the Excel sheet represents a separate work order that was submitted. Could RapidMiner be useful in this application? I've made a few quick processes in RapidMiner using your videos, but I'm not exactly sure which operations I need to manipulate my data in a fashion that will be useful to me. Any tips you could give me to point me in the right direction?

Thanks again for the videos!
ReplyDelete
Replies
Neil McGuiganMarch 20, 2011 at 2:15 PM
@JD,

Yes, rapidminer can help with this. you need to decide what you want to do, but it sounds like you need some more info before you can decide.

since you have no labeled column (or dependent/Y variable) all you can do for now is descriptive analysis and unsupervised learning, meaning you can describe each factor (which in this case would be a word), describe the relationships between each factor, and do the following unsupervised learning functions:

1. find similar observations (rows in excel)
2. find similar factors (words)
3. find which factors often co-occur.

i cover most of these in my videos. option 3 is the association analysis video. option 1 is the similarity video. the descriptive part is covered in word frequencies. you can also look at correlation among factors but that can take a lot of RAM.

i haven't covered option 2 yet, as it involves pretty heavy matrix algebra, but i will eventually. if you want to read up on principal components analysis (PCA) or singular value decomposition (SVD), then rapidminer can do those.

you will notice that my data is essentially the same as yours. it's just one big column of text with many rows. each cell is split into each word in rapidminer.

otherwise, you may want to extract labeled data from your dataset, for example, the product name or severity of the issue. then you can use the say naive bayes to see which factors are important (root causes, key indicators, etc). i'd be happy to help your company with that. feel free to email me, address on the contact page.

good luck

neil
ReplyDelete
Replies
Nuno AACostaAugust 30, 2011 at 8:44 AM
Hello. I'm making classification with RapidMiner. I generate and store the model. Then I open the model and try to classificate new and unlabelled text. However, I need the words list to make the processing of the new unlabelled text. I know how to write the words list (WordsListToData operator) but I don't know a way to get from data to words list!

any help?
thanks
ReplyDelete
Replies
db_girlNovember 9, 2011 at 8:23 AM
Hi, thx for the videos, they are really usefull to me. I would need this dataset or a simylar one, because i would like to learn it, and try myself. Where or how can i download this? Thanks again.
ReplyDelete
Replies
Neil McGuiganNovember 9, 2011 at 9:01 AM
@ db_girl. You can make your own sample files just by copying and pasting some of the job postings
ReplyDelete
Replies
Neil McGuiganNovember 9, 2011 at 9:02 AM
@db_girl. Actually, for parts 1-5, the data files are available as a download link in the comments on the first post
ReplyDelete
Replies
db_girlNovember 10, 2011 at 8:30 AM
Oh, i didnt seen it. Thanks a lot. :)
ReplyDelete
Replies
adam_redshawFebruary 4, 2012 at 5:54 AM
Hi

Your videos are very helpful and have managed to make a lot of progress. I am having difficulty however in collecting phrases. I wish to collect 2 words put together such as 'happy birthday' rather than thier single tokens 'happy' and 'birthday'. Im not sure how to do this as everything i have tried so far outputs a black sheet.
ReplyDelete
Replies
ClareFebruary 22, 2012 at 2:33 AM
Hi Neil
I am a new user of RapidMiner and am finding your videos and blog really useful so thank you :). My problem is a bit similiar to the above in that I am trying to use rapidminer to search my documents for a set of about 50 key terms and build a word frequency table based whether or not the key terms appear in the doc. I have no problems reading the documents and processing them normally to create the frequency table, however instead of all of the terms I just want the table be composed of the terms I am interested in. I had considered using the operator "keep document parts" but I am unsure whether this is the most efficient way of solving my problem! Any help truely appreciated
ReplyDelete
Replies
Neil McGuiganFebruary 22, 2012 at 10:56 AM
@adam_redshaw, to find phrases, you will want to use the n-gram operator with a value of n above 2
ReplyDelete
Replies
GjorgeApril 12, 2012 at 6:01 PM
Hi Neil,

Thank you for the videos.I have a question. If I want to mine a pdf or word doc which extraction can be used? Thank you in advance
ReplyDelete
Replies
Neil McGuiganApril 12, 2012 at 7:00 PM
@ Gjorge. i just tried the Read Text operator with a PDF and it worked fine. You could also try an external java library to extract text from pdfs, such as http://pdfbox.apache.org/ .

Didn't work with a DOCX file though. For that you could try: http://poi.apache.org/index.html
ReplyDelete
Replies
GjorgeApril 16, 2012 at 8:28 PM
Hi Neil. I'm getting "com.rapidminer.operator.text.Document cannot be cast to com.rapidminer.example.ExampleSet
". The sequence includes: 1. Read document (pdf) ---> 2. Process Document from Data 2a. Tokenize 2.b Transform case. I'm trying to create word vector. Thank you for your assistance.
ReplyDelete
Replies
Gerrard@Mesin FotocopyMay 9, 2012 at 2:54 AM
Hey there, You have done a fantastic job. I will certainly digg it and personally suggest to my friends. I am confident they'll be benefited from this website.
ReplyDelete
Replies
hanissterJune 26, 2012 at 8:22 AM
Hi Neil. I want to perform the steps you've shown in one of ur video on processing documents and do kmeans.
I did manage to get it under GUI, but could I break the process into sub process (eg; 1-tokenize, lowercase, stopwords 2-stemming 3-tfidf 4-kmeans) If can, can you guide me how? I want to call it from my interface, using netbeans. If possible, can you share the code which call the .rmp and to specify its parameter? and also code to interprete result into txt file. I really need your help. Please.
ReplyDelete
Replies
RoyaNovember 8, 2012 at 5:28 PM
Hi, Thank you for your great videos. I am e beginner, I understand all the steps but I don't know how I can store my documents.
I have about 30000 documents which are in xml format. how ReadDatabase can read them?

thank you very much
ReplyDelete
Replies
UnknownDecember 1, 2012 at 9:42 AM
Neil,
I am working on some document clustering and have a problem. I have an excel file with a single column of text, and each row is a document. This seems to work well with the data to document operator, however once I have done this I cannot tokenize my data, since the tokenize operator seems to expect only a single document rather than many? I am stuck, and I have no idea how to work around this.
Thanks
ReplyDelete
Replies
AnonymousDecember 16, 2012 at 11:48 AM
Hello, great video. It helps alot. for the people who are using read excel. You have to use nominal to text othwerwise it won't work.
ReplyDelete
Replies
AnonymousMay 7, 2013 at 8:46 PM
Hello, Good evening.
Thanks for the text classification videos, which is the structure of the database input?, I would be useful that information.

Thanks for your reply.
Greetings.
ReplyDelete
Replies
AnonymousJuly 12, 2013 at 4:12 PM
Hello, I am new to RapidMiner and I am trying to do SVD on some documents. I have one document per line in a file. How can I get RapidMiner to split this file into several documents? I tried using the cut document operator, but it refuses to accept the output of either the create document operator or the read document operator because they do not have any 'attribute's...
ReplyDelete
Replies
MoohebatAugust 5, 2013 at 10:17 PM
Hi there. I had a question. I had a similar task and I follow your method in rapidminer. The difference is that when I select my context from database they come from two categories and I bring that field also. Is there any way to show these words belong to which categories? When I see the wordlist I cannot judge which one comes from which category! Any idea?
ReplyDelete
Replies
sandeepAugust 22, 2013 at 12:25 PM
Hi Neil,
I have a question regarding the end result of the process you explained above. How do you explain the word list, their occurrences and the documents in which it was found? If i am comparing two websites or documents for the comparison of words, in the end result, how do i find out which word occurred on which document? For example, Can we add labels or anything to the result to display 'document A' has this word 100 times, 'document B' has this word 200 time etc.?
ReplyDelete
Replies
UnknownOctober 21, 2013 at 5:32 AM
hello,
i am new to RapidMiner and i am trying to do count all the relative words, term frequency and on.. using Vector Creation TF - IDF. Process Documents form Files. I have three different Document files in different folders. i took all the documents into Text directories in Parameters tab after that i did vector creation using Tokenizer and Filter stop word(English) then executed but finally data showing in some special symbol format. so here i need output in English words so what can i do for that ?
Do Reply..
ReplyDelete
Replies
UnknownOctober 29, 2013 at 11:25 AM
hi Neil,
i'm new to RapidMiner and i am trying to text association rule for some documents. i have problem with Fp- Growth. in execution of the process before association rule Fb-growth taking much time even taking small amount of files. Im getting a process failure about memory. so how much memory is sufficient to run the process and why its coming? Reply.
ReplyDelete
Replies
UnknownOctober 29, 2013 at 10:19 PM
Hello, I am new to RapidMiner and I am trying to text processing on different word documents. When i am going to count the most frequent words, i have problem with Some Noisy data( like special characters ANª ,ANÃ,ANÌ ,ANù , AO ,AOoclóª ,AOã ,AP and etc ) are coming in the word list. can u help me out from this Noisy data. i want to Resume Sorting & Clustering using Text Mining Techniques.
ReplyDelete
Replies
UnknownOctober 30, 2013 at 4:56 AM
hi Neil McGuigan, I am new to Rapid Miner and i'm trying to segregate group of sample document resumes for job purpose. here in data view i'm getting some noisy data which is not relevant to my data, another one is how plot view works? can u please help me out of this problem.
ReplyDelete
Replies

Add comment

Vancouver Data Blog by Neil McGuigan

Pages

Tuesday, November 9, 2010

Text Analytics With Rapidminer Part 2 of 6 - Processing Text

37 comments:

Archive