Date of Award
Doctor of Philosophy
Multivariate and high-dimensional datasets typically contain subgroups that may not be immediately apparent. To reveal these groups, cluster analysis is performed. Cluster analysis is an unsupervised machine learning technique commonly employed to partition a dataset into distinct categories referred to as clusters. The k-means algorithm is a prominent distance-based clustering method. Despite overwhelming popularity, the algorithm is not invariant under non-singular affine transformations and is not robust, i.e., can be unduly influenced by outliers. To address these deficiencies, we propose an alternative model-based clustering procedure by minimizing a “trimmed” variant of the negative log-likelihood function. We develop a “concentration step”, vaguely reminiscent of the classical Lloyd’s algorithm, that can iteratively reduce the objective function converging to local minimum in a finite number of steps. Being a local optimization technique, our algorithm depends on the choice of “warmstart.” We develop a new sampling procedure to select appropriate warmstarts. For high-dimensional or sparse datasets, cluster covariances become ill-conditioned. Consequently, we equipped our proposed method with high-dimensional capabilities by using a regularization technique that replaces ill-conditioned covariances with well-conditioned counterparts. For n > p, a formal proof reveals that the objective function possesses the affine-invariant property under non-singular affine transformations rendering the procedure affine invariant. Extensive simulations for synthetic and real-world datasets are conducted to assess the performance of our algorithm with respect to multiple cluster quality metrics. Compared to such state-of-the-art competitors as k-means (or trimmed k-means) and tclust, empirical studies indicate competitiveness and oftentimes superiority of our algorithm.
Recieved from ProQuest
Andrews Tawiah Anum
Anum, Andrews Tawiah, "Theoretical And Computational Aspects Of Robust Cluster Analysis For Multivariate And High-Dimensional Datasets" (2023). Open Access Theses & Dissertations. 3758.