Constrained Clustering: Taxonomy, New Optimization Models, and Hybridizations with Singular Problems of Machine Learning

  1. González Almagro, Germán
Supervised by:
  1. Salvador García López Co-director
  2. José Ramón Cano de Amo Co-director

Defence university: Universidad de Granada

Fecha de defensa: 22 May 2023

  1. David Camacho Fernández Chair
  2. Natalia Díaz Rodríguez Secretary
  3. Alberto Cano Committee member

Type: Thesis


In the golden era of information, vast amounts of data are available to perform analysis on and extract valuable insight from. The area of science devoted to this problem is known as Knowledge Discovery in Databases; particularly, it is its branch of Data Science where this thesis is framed in. Specifically, this thesis focuses on clustering techniques that are capable of including relational information into the clustering process. This type of information does not fit into the classic supervised and unsupervised learning paradigms. However, the Semi- Supervised Learning (SSL) paradigm does provide us with the tools to perform clustering in the presence of relational information. This task is known as Constrained Clustering (CC). Four objectives shape this thesis: 1. A comprehensive study in the area of CC, from an SSL standpoint. The goal of this objective is to produce the first comprehensive analysis of the CC state-of-the-art, including a standardization of the experimental procedures and a ranking of all CC methods proposed so far. 2. The development of metaheuristic-based proposals for the CC problem, from both single-objective and multi-objective optimization perspectives. Two metaheuristicbased methods are designed to achieve this objective. Both are first proposed in this thesis and have been specifically designed to obtain quality solutions for the CC problem. An empirical study compares both methods against the state-of-the art in their specific areas, demonstrating their superiority. 3. The proposal of hybrid models for the CC problem. Motivated by the high quality results that hybrid methods usually obtain in the field of CC, this thesis introduces a new hybrid model that combines the two broadest categories in the area: partitional CC and constrained distance metric learning. This proposal also includes a procedure to automatically determine the relevance of the pieces that make up the set of relational information. The empirical study carried out provides evidence of the proposal being superior to the state-of-the-art. 4. Finally, motivated by the existence of real-world problems where multiple types of background knowledge are available, this thesis tackles the issue of combining relational and monotonicity information. This combination gives rise to the monotonic constrained clustering paradigm. The suitability of the problem is proved, and a baseline algorithm is proposed. The experimental study shows the superiority of monotonic constrained clustering algorithm over both purely constrained and purely monotonic clustering algorithms. This finding is backed by positive results in a classic set of benchmarks and a real-world monotonic constrained clustering problem. The four objectives have been successfully addressed, and the thesis has made significant contributions to the field of CC from the point of view of SSL. The comprehensive study carried out in the first objective provides a solid basis for understanding the state-of-the-art in CC and enables the standardization of experimental procedures, which is crucial for a scientifically sound comparison of different methods. The development of metaheuristicbased proposals in the second objective provides new and efficient techniques to solve the CC problem, while the proposed hybrid models in the third objective demonstrate the potential of combining different approaches to further improve the quality of the results. Finally, the proposed monotonic constrained clustering paradigm in the fourth objective addresses the problem of combining multiple types of background knowledge and achieves superior results over purely constrained and purely monotonic clustering algorithms.