CIAO: Clustering Incomplete data using Alternating Optimization

Background

Clustering technique is used to find groups of genes that show similar expression patterns under multiple experimental conditions. Nonetheless, the results obtained by cluster analysis are influenced by the existence of missing values that commonly arise in microarray experiments. Because a clustering method requires a complete data matrix as an input, previous studies have estimated the missing values using an imputation method in the preprocessing step of clustering. However, a common limitation of these conventional approaches is that once the estimates of missing values are fixed in the preprocessing step, they are notchangedduring subsequentprocesses of clustering; badly estimated missing values obtained in data preprocessing are likely to deteriorate the quality and reliability of clustering results.

The CIAO (Clustering Incomplete data using Alternating Optimization) method does not require a prior imputation method. To reduce the influence of imputation in preprocessing, it takes an alternative optimization approach to find better estimates during iterative clustering process. This method improves the estimates of missing values by exploiting the cluster information such as cluster centroids and all available non-missing values in each iteration.

The main technical ideas behind how this program works appear in these papers:

Dae-Won Kim, Kwang H. Lee, and Doheon Lee, "Towards clustering of incomplete microarray data with the use of imputation," Bioinformatics 23(1):107-113, 2007.

This software is a Java implementation of CIAO clustering method, highy specialized on problems of clustering incomplete microarray data. The original version of this program was written by Dae-Won Kim.


Download

This program is available for download for non-commercial use, licensed under the GNU General Public License, which is allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commerical software.

Download CIAO Program (2008-01-08)

The zipped package includes componenets for source files and a sample input file.


Sample Input and Output

It will find clusters for the incomplete gene expression data, outputting the results to a file named for the original file, but with a .ciao extension. The Jama libraries for the matrix computation must be in your path.

[Usage]:
    $ java CIAO (input_file) (num_clusters) (fuzziness) (learning_rate)

[Description]:
    input_file - an input data file
    num_clusters - the number of clusters to be clustered
    fuzziness - the degree of fuzziness of each datum
    learning_rate - the learning parameter (tau)

[Example]:
    $ java CIAO data/Yeast.sporulation.missing 5 2.5 100

Currently, CIAO reads tab-delimited text files in a particular format, described below. By convention, in CIAO input files rows represent data (e.g. genes) and columns represent samples or observations (e.g. a single microarray hybridization). The first column and row contain the labels for genes and samples respectively, and the remaining cells contain incomplete data for the appropriate gene and sample. For a simple time-course, an example input file might look like this (e.g. Yeast.sporulation.missing):

    ORF   ORF   t0   t1   t2
    YHR007C   YHR007C   0.27   0.61   1.44
    YOL109C   YOL109C   0.86            0.67
    YAL059C   YAL059C            1.60   0.51

The output file contains the clustering result of the CIAO program. Each row represents a gene name and its assigned cluster number, followed by the fully estimated measurements. The base of cluster number begins with '0'. An example output file when clustering yeast genes into 3 clusters might look like this (e.g. Yeast.sporulation.gk):

    YHR007C    0   (...missing data have been estimated)
    YOL109C    1   
    YAL059C    2