GK: Gustafson-Kessel Clustering Program

Background

Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes.

This program detects clusters of different geometical shapes in a data set by exploiting an adaptive distance norm. The adaptive norm is calculated by a fuzzy covariance matrix of each cluster in which the eigenstructure of the covariance matrix is used as an indicator of the shape of the cluster.

The main technical ideas behind how this program works appear in these papers:

Dae-Won Kim, Kwang H. Lee, and Doheon Lee, "Detecting Clusters of Different Geometrical Shapes in Microarray Gene Expression Data," Bioinformatics 21(9):1927-1934, 2005.

R. Babuska, Fuzzy Modeling For Control. Kluwer Academic Publishers, Boston, 1998.

E.E. Gustafson and W. Kessel, "Fuzzy clustering with a fuzzy covariance matrix," Proc. IEEE Conf. on Decision and Control, San Diego, IEEE Press, Piscataway, NJ, pp. 761-766, 1979.

This software is a Java implementation of GK clustering method, highy specialized on problems of bioinformatics. The original version of this program was written by Dae-Won Kim.


Download

This program is available for download for non-commercial use, licensed under the GNU General Public License, which is allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commerical software.

Download GK Clustering Program (2005-09-01)

The zipped package includes componenets for source files and a sample input file.


Sample Input and Output

It will find clusters for the gene expression data, outputting the results to a file named for the original file, but with a .gk extension. The Jama libraries for the matrix computation must be in your path.

[Usage]:
    $ java GK (input_file) (num_clusters) (fuzziness)

[Description]:
    input_file - an input data file
    num_clusters - the number of clusters to be clustered
    fuzziness - the degree of fuzziness of each datum

[Example]:
    $ java GK data/Yeast.sporulation 3 2.5

Currently, GK reads tab-delimited text files in a particular format, described below. By convention, in GK input files rows represent data (e.g. genes) and columns represent samples or observations (e.g. a single microarray hybridization). The first column and row contain the labels for genes and samples respectively, and the remaining cells contain data for the appropriate gene and sample. For a simple time-course, an example input file might look like this (e.g. Yeast.sporulation):

    ORF   t0   t1   t2
    YHR007C   0.27   0.61   1.44
    YOL109C   0.86   0.21   0.67
    YAL059C   0.35   1.60   0.51

The output file contains the clustering result of the GK program. Each row represents a gene name and its assigned cluster number. The base of cluster number begins with '1'. An example output file when clustering yeast genes into two clusters might look like this (e.g. Yeast.sporulation.gk):

    YHR007C    1
    YOL109C    2
    YAL059C    2