Concept Loss controls concept sensitivity.
Humans use abstract concepts instead of hard features for generalization. Recent interpretability research has focused on human-centered concept explanations of neural networks. We present Concept Distillation, a novel method and framework for concept-sensitive training to induce human-centered knowledge into the model. We use Concept Activation Vectors (CAVs) to estimate the model’s sensitivity and possible biases to a given concept. We extend CAVs to ante-hoc training from post-hoc analysis. We distill the conceptual knowledge from a pretrained knowledgeable teacher to a student model focused on a single downstream task. Our method can sensitize or desensitize the student model towards concepts. We show applications of concept-sensitive training to debias classification and to induce prior knowledge into a reconstruction problem. We also introduce the TextureMNIST dataset to evaluate the presence of complex texture biases. We show that concept-sensitive training can improve model interpretability, reduce biases, and induce prior knowledge.
Concept Loss controls concept sensitivity.
Proto-types enable intermediate-layer sensitivity calculation.
Teacher helps avoid model concept fake associations due to bias.
@inproceedings{gupta2023concept,
title={Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement},
author={Gupta, Avani and Saini, Saurabh and Narayanan, PJ},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}