Classification Based on Predictive Association Rules of Incomplete Data

Background

Classification based on predictive association rules(CPAR) is a widely used associative classification method. Despite its efficiency, the analysis results obtained by CPAR will be influenced by missing values in the data sets, and thus it is not always possible to correctly analyze the classification results. This algorithm improves CPAR to deal with the problem of missing data. This method showed better classification results for the incomplete data sets compared to the conventional CPAR.

This program extends the CPAR to improve classification performance of incomplete data set. This method deals with missing values using probabilities calculated from the expected frequencies of the different values for an attribute.

The main technical ideas behind how this program works appear in these papers:

Jeonghun Yoon, and Dae-Won Kim, "Classification Based on Predictive Association Rules of Incomplete Data," IEICE Transactions on Information and Systems E95-D(5):1531-1535, 2012.

X. Yin, and J. Han, "CPAR: Classification based on predictive association rules," Proc. 3rd SIAM International Conference on Data Mining, pp.331-335, 2003.

This software is a Matlab implementation of proposed method, highy specialized on problems of categorical data set classification. The original version of this program was written by Jeong-Hun Yoon.


Download

This program is available for download for non-commercial use, licensed under the GNU General Public License, which is allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commerical software.

Download CPAR program for Incomplete Data (2011-09-01)

The zipped package includes componenets for source files and a sample input file.


Sample Input and Output

It will predict the class label for the categorical data, ouputting the classified results to a matrix named for user-specified variable. This code can executed under Matlab command window.

[Usage]:
   >> min_sup = 0.05;
   >> min_gain = 0.7;
   >> best_k = 5;
   >> result_class = RCPAR(train_data, train_class, input_data, min_sup, min_gain, best_k);

[Description]
   min_sup, min_gain, best_k - The parameters used in CPAR
   train_data - The categorical data that can have missing values
   train_class - The class of train_data
   input_data - The categorical data that will be classified by this program

Download file includes demo that processes 30% hold out validation for ionosphere data set.

The output matrix is predicted class label of input_data.
And, the program generates "generated_rules.mat" file that is predictive assocation rules.
The file includes three matrices, item_table, rules_body, and rules_info.
First column of item_table is feature number and second column is value.
Row of rules_body is rule's conditions composed index of item_table.
Zero means a finish of the rule.
rules_info includes information of rules.
Each column of rules_info indicates target class, confidence, support and length of the rule.

-Example-

   item_table
    2    0
    1    1
    3    1
    1    0
    2    1
    3    0


   rules_body
    1    3    0    0    0    0    0
    1    2    4    5    0    0    0
    2    4    5    0    0    0    0


   rules_info
    1    0.98    0.18    2
    2    0.94    0.13    4
    2    0.87    0.07    3