K-means is an unsupervised clustering algorithm that tries to partition a given dataset into k clusters, where each point belongs to only one cluster. The point of this algorithm is to classify data into different categories which may help provide structure to otherwise complex data sets. Although K-means is simple to implement and generally effective in categorizing data, there is no guarantee that objects will be correctly grouped together. This poster proposes a new supervised clustering algorithm, ClusterCat, that utilizes K-means. Supervised classification algorithms select training items and categorize test points based on that training. Unsupervised classification algorithms generate clusters based on feature characteristics. ClusterCat is unique as it is a supervised algorithm that leverages an unsupervised technique. ClusterCat first divides the dataset based on known category labels (supervised categorization) and then runs the K-means algorithm on each category (unsupervised categorization). This process creates smaller subcategories which are then used to make classification decisions. A test point is placed in a subcategory based on feature similarity, which allows the overarching category membership to be deduced. ClusterCat shows promise to improve classification decisions and could help demystify complex data structures.
Download Full Text (674 KB)
DiStefano, Paul, "ClusterCat Algorithm: Supervised Subcategory K-Means Clustering" (2021). Research Days Posters 2021. 81.