Abstract
Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling (GCPM), which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.
Method
Our framework consists of two main components:
Generative Concept Proxy Modeling (GCPM) extracts guideline-aware textual descriptions from images via multimodal LLM and encodes them into embeddings for efficient clustering. This training-free approach enables cross-domain generalization while outperforming supervised methods on public benchmarks.
MST-based LLM Traversal refines initial clusters by constructing a Minimum Spanning Tree and selectively querying the LLM for semantic merging decisions. This hybrid design leverages embedding-based efficiency for routine clustering decisions while applying selective LLM reasoning only where semantic complexity demands it. The expected number of LLM calls is O(M log M), compared to O(M²) for naive pairwise comparison methods.
Results
Our framework achieves state-of-the-art results across diverse clustering tasks without task-specific training:
- General Clustering: 94.1% ACC on CIFAR-10, 98.8% on STL-10, 98.8% on ImageNet-10 (with K-Means)
- Multiple Clustering: 99.9 NMI on Fruit (color+species average), 90.0 NMI on Cards (number+suits average)
- Long-tail Clustering (ABO-LC): 55.7% ACC, 93.3 NMI with MST Traversal — significantly outperforming IC|TC baseline (5.5% ACC)
- MST Traversal improvements: Up to 72.1↑ ARI gain on ImageNet-10 when cluster numbers are unknown
BibTeX
@inproceedings{zhong2026universal,
title={Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent},
author={Zhong, Wenliang and Barton, Rob and Goncalves, Lucas and Kumar, Kushal and Jiang, Feng and Ma, Hehuan and Guo, Yuzhi and Bansal, Vidit and Bouyarmane, Karim and Huang, Junzhou},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}