Big Big Things in my Little Little World: More about Weka

Getting Started

1 Weka Installation

Weka requires Java. We may already have Java installed and if not, there are versions of Weka listed on the download page that include Java. The latest version of Weka can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/

2 User Interfaces

The GUI chooser in Weka consist of four options:
• Explorer: It is an environment for exploring data with WEKA.
• Experimenter: It is used for performing experiments and conducting statistical tests between learning schemes.
• Knowledge Flow: It supports same functions as the Explorer but with a drag and drop interface.
• Simple CLI: This is a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

3 Package Installation

Choose Tools -> Package Manager from Weka GUI Chooser. The package manager’s window is split horizontally into two parts: at thetop is a list of packages and at the bottom is a mini browser that can be usedto display information on the currently selected package.

4 Data Sets

Each entry in a dataset is an instance of the java class: weka.core.Instance.
4.1 ARFF
An ARFF (Attribute-Relation File Format) file is an ASCII text file thatdescribes a list of instances sharing a set of attributes. It consist of:
• header section is prefixed by @RELATION
• each attribute is indicated by @ATTRIBUTE
• data section is prefixed by @DATA
• data is comma separated, with the class as the last attribute
4.2 XRFF
The XRFF (Xml attribute Relation File Format) is a representing the data ina format that can store comments, attribute and instance weights.The following file extensions are recognized as XRFF files:
• .xrff: The default extension of XRFF files
• .xrff.gz: The extension for gzip compressed XRFF files
4.3 Converters
Weka contains converters for the following data sources:
• ARFF files (ArffLoader, ArffSaver)
• C4.5 files (C45Loader, C45Saver)
• CSV files (CSVLoader, CSVSaver)
• Files containing serialized instances (SerializedInstancesLoader, SerializedInstancesSaver)
• JDBC databases (DatabaseLoader, DatabaseSaver)
• libsvm files (LibSVMLoader, LibSVMSaver)
• XRFF files (XRFFLoader, XRFFSaver)
• Text directories for text mining (TextDirectoryLoader)
4.4 Databases
Weka comes with example files for the following databases:
• DatabaseUtils.props.hsql - HSQLDB (>= 3.4.1)
• DatabaseUtils.props.msaccess - MS Access (> 3.4.14, > 3.5.8, > 3.6.0)
• DatabaseUtils.props.mssqlserver- MS SQL Server 2000 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.mssqlserver2005 - MS SQL Server 2005 (>= 3.4.11, >= 3.5.6)
• DatabaseUtils.props.mysql - MySQL (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.odbc - ODBC access via Sun’s ODBC/JDBC drivers
• DatabaseUtils.props.oracle - Oracle 10g (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.postgresql - PostgreSQL 7.4 (>= 3.4.9, >= 3.5.4)
• DatabaseUtils.props.sqlite3 - sqlite 3.x (> 3.4.12, > 3.5.7)

5 Using the API

Weka provides an API that can be directly invoked from Java code. It can be used to embed machine learning algorithms in Java programs. The following are the steps required to embed a classifier:
Step 1: Express the problem with features
Step 2: Train a Classifier
Step 3: Test the classifier
Step 4: use the classifier

6 Weka Integration to Pentaho Data Integration

Weka can be easily integrated with the ETL tool Spoon using the Weka Scoring Plugin. The following steps are to be followed for the plugin installation:
1. The Weka scoring plugin can be downloaded from: http://wiki.pentaho.com/display/EAI/List+of+Available+Pentaho+Data+Integration+Plug-In
2. Unpack the plugin archive and copy all files in the WekaScoringDeploy directory to a sub-directory in the plugins/steps directory of the Kettle installation.
3. Copy the "weka.jar" file from the Weka installation folderto the same sub-directory in plugins/steps as before.

7 Pros and Cons of Weka

7.1 Advantages
• Open source
• Extensible
• Portable
• Relatively easier to use
• Large collection of Data Mining algorithms
7.2 Disadvantages
• Sequence modelling is not covered by the algorithms included in the Weka distribution
• Not capable of multi-relational data mining
• Memory bound

8 Projects based on Weka

There are many projects that extend or wrap WEKA. Some of these include:
• Systems for natural language processing: GATE is an NLP tool that uses Weka for natural language processing.
• Knowledge discovery in biology: BioWEKA is anextension to WEKA for tasks in biology
• Distributed and parallel data mining: There are a number of applications that use Weka for distributed data mining. Some of them include Weka- Parallel, Grid Weka, FAEHIM and Weka4WS.
• Open-source data mining systems: Many data mining systems provide plugins to access Weka’s algorithms. The R statistical computing environment provides an interface to Weka using RWeka package.
• Scientiﬁc workﬂow environment: The Kepler open- source scientific workflow platform is developed based on Weka.

9 Alternatives to Weka

The following are the main alternatives to Weka:

R is a powerful statistical programming language. It is derived from the S language which was developed by Bell Labs.
ELKI is a similar project to Weka with a focus on cluster analysis
KNIME is a machine learning and data mining software implemented in Java.
MOA is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand.
Neural Designer is a data mining software based on deep learning techniques written in C++.
Orange is a similar open-source project for data mining, machine learning and visualization written in Python and C++.
RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka.

One of the most challenging aspects of open source software is to decide what to include in the software. So the contributions to the software are controlled. This limits the community involvement. It can be managed by using packages. The package management system of Weka is the best example for this. The mailing list of open source software are easier to maintain if the users are researchers. Weka is developed and maintained by a team of researchers at Waikato University. One of the main advantage of using Weka is that it has been incorporated into many open source projects. Hence for a beginner in data mining, among the available open source projects Weka forms the best choice.

Head Up, Stay Strong, Fake a Smile, Move on.....

2 comments:

Unknown12 September 2016 at 11:14
Hii Anju,

Hope you know about OpenSubspace...subspace clustering package.In one of the sites, it is written that this package can be installed and used with Weka 3.5.8, but there is no package manager in weka 3.5.8. Nevertheless, I have installed Weka 3.7.3 which has package manager...I have installed the package OpenSubspace..but all the 12 algorithms are inactive because of which I am not able to run any of the algorithms for subspace clustering!
I would like to know what might be the possible cause and solution for the same.

Thanks in advance...

Big Big Things in my Little Little World

Sunday, 20 December 2015

More about Weka