Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement

NeurIPS 2023

1️⃣M42, UAE 2️⃣CVIT, KCIS, IIIT Hyderabad

Abstract

Humans use abstract concepts instead of hard features for generalization. Recent interpretability research has focused on human-centered concept explanations of neural networks. We present Concept Distillation, a novel method and framework for concept-sensitive training to induce human-centered knowledge into the model. We use Concept Activation Vectors (CAVs) to estimate the model’s sensitivity and possible biases to a given concept. We extend CAVs to ante-hoc training from post-hoc analysis. We distill the conceptual knowledge from a pretrained knowledgeable teacher to a student model focused on a single downstream task. Our method can sensitize or desensitize the student model towards concepts. We show applications of concept-sensitive training to debias classification and to induce prior knowledge into a reconstruction problem. We also introduce the TextureMNIST dataset to evaluate the presence of complex texture biases. We show that concept-sensitive training can improve model interpretability, reduce biases, and induce prior knowledge.



Our main contributions

Concept Loss controls concept sensitivity.

Proto-types enable intermediate-layer sensitivity calculation.

Teacher helps avoid model concept fake associations due to bias.

Please refer our paper for results.

BibTeX

@inproceedings{gupta2023concept,
        title={Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement},
        author={Gupta, Avani and Saini, Saurabh and Narayanan, PJ},
        booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
        year={2023}
      }