Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent

Wenliang Zhong1, Rob Barton2, Lucas Goncalves2, Kushal Kumar2, Feng Jiang1, Hehuan Ma1, Yuzhi Guo1, Vidit Bansal2, Karim Bouyarmane2, Junzhou Huang1
1The University of Texas at Arlington    2Amazon
CVPR 2026
Overview of Guideline-Driven Clustering Agent

We introduce the first universal clustering framework that handles diverse image clustering scenarios through textual guidelines — spanning from general to fine-grained tasks, from global to local criteria, and from balanced to long-tail distributions.

Abstract

Unifying image clustering across different clustering scenarios remains challenging due to fundamental gaps among tasks. We introduce a Guideline-Driven Image Clustering Agent, the first universal framework that bridges these gaps through textual guidelines. To incorporate complex guidelines without task-specific training, we propose Generative Concept Proxy Modeling (GCPM), which generates guideline-aware embeddings via concept proxy extraction. For scenarios requiring automatic cluster discovery, we introduce LLM Traversal based on Minimum Spanning Tree that selectively applies LLM reasoning for complex semantic judgments. Our method generalizes across diverse clustering scenarios spanning from general to fine-grained categorization, from global to local criteria, and from balanced to long-tail distributions. Our framework consistently outperforms specialized methods across diverse clustering tasks.

Method

Our framework consists of two main components:

Generative Concept Proxy Modeling (GCPM) extracts guideline-aware textual descriptions from images via multimodal LLM and encodes them into embeddings for efficient clustering. This training-free approach enables cross-domain generalization while outperforming supervised methods on public benchmarks.

MST-based LLM Traversal refines initial clusters by constructing a Minimum Spanning Tree and selectively querying the LLM for semantic merging decisions. This hybrid design leverages embedding-based efficiency for routine clustering decisions while applying selective LLM reasoning only where semantic complexity demands it. The expected number of LLM calls is O(M log M), compared to O(M²) for naive pairwise comparison methods.

Method overview showing GCPM and MST-based LLM Traversal

Results

Our framework achieves state-of-the-art results across diverse clustering tasks without task-specific training:

  • General Clustering: 94.1% ACC on CIFAR-10, 98.8% on STL-10, 98.8% on ImageNet-10 (with K-Means)
  • Multiple Clustering: 99.9 NMI on Fruit (color+species average), 90.0 NMI on Cards (number+suits average)
  • Long-tail Clustering (ABO-LC): 55.7% ACC, 93.3 NMI with MST Traversal — significantly outperforming IC|TC baseline (5.5% ACC)
  • MST Traversal improvements: Up to 72.1↑ ARI gain on ImageNet-10 when cluster numbers are unknown
Results 1
Results 2
Results 3
Results 4

BibTeX

@inproceedings{zhong2026universal,
  title={Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent},
  author={Zhong, Wenliang and Barton, Rob and Goncalves, Lucas and Kumar, Kushal and Jiang, Feng and Ma, Hehuan and Guo, Yuzhi and Bansal, Vidit and Bouyarmane, Karim and Huang, Junzhou},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}