Unique Concept Vectors through Latent Space Decomposition
To boost interpretability with concept vectors, a reverse engineering approach automates concept identification by analyzing the latent space of deep neural networks using Singular Value Decomposition. This framework combines factorization, latent space clustering, and output-sensitivity analyses to isolate directions corresponding to unique concepts.
This study introduces a novel post-hoc unsupervised method to interpret deep learning models by automatically uncovering the concepts learned during training. By decomposing the latent space of a layer and refining it through unsupervised clustering, concept vectors aligned with directions of high variance are identified. These vectors represent semantically distinct concepts relevant to model predictions. Experiments demonstrate the interpretability, coherency, and relevance of these concepts to the task. Additionally, the method is shown to effectively identify outlier training samples affected by various confounding factors, facilitating bias detection and error source discovery within training data across different data types and model architectures.