COLLOQUIUM Computer Science Department, Boston University Speaker: Jennifer Dy Northeastern University Title: Clustering High-Dimensional Data Date: November 3, 2004 Time: 3pm Place: MCS 135 (for directions, see www.cs.bu.edu/colloquium) Abstract: Creating effective algorithms for unsupervised learning is important because vast amounts of data preclude humans from manually labeling the categories of each instance. In addition, human labeling is expensive and subjective. Therefore, a majority of existing data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is to group ``similar'' objects together. "Similarity" is typically defined by a metric or a probability model. These measures are highly dependent on the features representing the data. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Moreover, many clustering algorithms fail when dealing with high-dimensions. In this talk, I will present three approaches for dealing with clustering in high-dimensional spaces: 1. Feature subset selection for unsupervised learning, 2. Feature selection and clustering in an interactive visualization environment; and 3. Hierarchical feature transformation and clustering. This talk explores the feature subset selection for unsupervised learning problem. We investigate the problem through our algorithm called FSSEM (Feature Subset Selection wrapped around Expectation-Maximization clustering) and through two different performance criteria for evaluating candidate feature subsets: maximum likelihood and scatter separability. We identify two issues: the need for selecting the number of clusters, and the need for normalizing the bias of feature selection criteria with respect to dimension. We show theoretical proofs on the dimensionality biases, and present a normalization scheme that can be applied to any criteria to ameliorate these biases. In addition to our automated algorithm, we developed Visual-FSSEM, which incorporates visualization and feature selection in an interactive environment. Finally, we present an automated approach for building hierarchical mixtures of probabilistic principal component analyzers. But, before I talk about these various algorithms, I'm going to talk about an application area where I first encountered the need to perform feature selection and clustering simultaneously, which is medical content-based image retrieval. Host: Simon Kasif (sullivan.bu.edu/kasif)