In this page I intend to provide useful hints and tips (recipes) for working
with text data in WEKA. The information
is organized as a list of blog posts and references, plus additional material
like code and text collections.
I suggest to read my following posts on text classification with WEKA in
the publication order:
Mining in WEKA: Chaining Filters and Classifiers explains how and why
you should so when evaluating your text classifiers using cross-fold validation.
The explanation is done using the Explorer tools, and it helps as a quick
introduction to the process of building a text classifier in WEKA, along
Mining in WEKA Revisited: Selecting Attributes by Chaining Filters describes
how to complete the life-cycle of the learning process by adding feature
selection to it, by using the
Line Functions for Text Mining in WEKA presents how perform previous
experiments with the
MultiFilter classes but now in the command line interface instead
on WEKA's Explorer.
Simple Text Classifier in Java with WEKA presents and discuses two little
programs as examples of how to integrate WEKA into your Java code for text
Text Classification with WEKA, Part 1: Data Analysis shows an application
of text classification to processing URLs text as a complement to URL
database-based filtering in Web Filters. This first post just explains how
I have built the dataset, while an upcoming post will explain my ongoing
Vocabulary from Train to Test Datasets in WEKA Text Classifiers discusses
three ways of mapping the set of terms used in the representations of the
training and test sets of a text dataset for enabling learning, namely using
batch filters, the
FilteredClassifier class and the
Identification as Text Classification with WEKA explains how to build
an automated language guesser for texts as a complete example of a Text Mining
process with WEKA, and in order to demonstrate a more advanced usage of the
Sentiment Analysis with WEKA shows how to configure and run an experiment
on sentyment analysis and opinion mining using WEKA, and specially the
TextDirectoryLoader and the
baselines of keyword and learning based sentiment analysis provides a
basic example of using SentiWordNet for a keyword-based approach to sentiment
classification, and compares it with a learning-based approach based in WEKA.
Code for Text Indexing with WEKA shows how to index a text dataset using
your own Java code and the
StringToWordVector filter in WEKA.
Analysis of N-Gram Tokenizer in WEKA, which analyzes the WEKA class
NGramTokenizer in terms of performance, as it depends on the
complexity of the regular expression used during the tokenization step.
Text Mining Trick: Copying Options from the Explorer to the Command Line discusses how to configure a
Text Mining experiment in the Explorer GUI and get a command line text to execute. This is very interesting
because configuring experiments in the command line can be far from trivial.
Do you want me to deal with some specific topic?
Just let me know.
I have some other posts on WEKA, like the following ones:
All my posts related to WEKA can be found
Interesting references for working with WEKA include:
in your Java code provides an excelent introduction to how to use
the classes Instances, Filter, Classifier, Clusteres, Evaluation and
AttributeSelection, in your own code.
WEKA programmatic use
describes the learning process life-cycle and, more importantly, it explains
how to deal with attributes in your Java code.
Categorization with WEKA deals with transforming a directory structure
of classes (directories) and documents (inside those directories) into ARFF
format for further processing. The code is available at
files from Text Collections.
For testing your classifiers and integrating WEKA in your own code, I provide
the following stuff:
You will find most of this stuff at my
tmweka Github repository.
Please feel free to contact me
if you have any doubt or special requirement.