Text Mining in WEKA Cookbook

AD: Buy books on Text Mining and WEKA at Amazon:
Data Mining: Practical Machine Learning Tools and Techniques
Instant Weka How-to
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

In this page I intend to provide useful hints and tips (recipes) for working with text data in WEKA. The information is organized as a list of blog posts and references, plus additional material like code and text collections.

I suggest to read my following posts on text classification with WEKA in the publication order:

I have some other posts on WEKA, like the following ones:

All my posts related to WEKA can be found using the label WEKA.

Interesting references for working with WEKA include:

  • Use WEKA in your Java code provides an excelent introduction to how to use the classes Instances, Filter, Classifier, Clusteres, Evaluation and AttributeSelection, in your own code.
  • WEKA programmatic use describes the learning process life-cycle and, more importantly, it explains how to deal with attributes in your Java code.
  • Text Categorization with WEKA deals with transforming a directory structure of classes (directories) and documents (inside those directories) into ARFF format for further processing. The code is available at ARFF files from Text Collections.

For testing your classifiers and integrating WEKA in your own code, I provide the following stuff:

You will find most of this stuff at my tmweka Github repository.

Please feel free to contact me if you have any doubt or special requirement.

José María Gómez Hidalgo