Probing Deep Network Behavior with Internal Influence

In recent work with my colleagues at CMU, we focus on bringing greater transparency to these mysterious, yet effective, machine learning techniques.

Probing Deep Network Behavior with Internal Influence

In the recent years, deep neural networks have become increasingly powerful at tasks previously only humans had mastered. Deep learning has become widely used, and while it has many practitioners, its inner workings are far from well-understood. As the application of ML has increased, so has the need for algorithmic transparency, the ability to understand why algorithms deployed in the real world make the decisions they do. In recent work [1] with my colleagues at CMU, appearing in ITC 2018, we focus on bringing greater transparency to these mysterious, yet effective, machine learning techniques.

A Case for Algorithmic Transparency

Machine learning has become increasingly powerful and ubiquitous. A 2018 study [2] found that a contemporary deep network could more accurately diagnose retinal disease than retina specialists and optometrists. However, despite their expanding capability and proliferation into sensitive domains, deep networks are still found to discriminate, overfit, and be vulnerable to attacks. For example, recent work out of CMU [3] has shown that specially-crafted 3D-printed eyeglasses can fool state-of-the-art face recognition models.

There has therefore been a growing need to address these problems via a deeper understanding of modern machine learning methods, which are generally concerningly opaque and difficult to understand.

What do We Want to Explain?

When a deep neural network makes a decision, how do we know we can trust it? We often use measures like generalization error to give us a degree of confidence about our models’ reliability. Of course, this doesn’t give us certainty on individual predictions, and furthermore, when a network is deployed in the real world, it may have unexpected results if the train and test data don’t capture the distribution in the wild. Another approach is increasing trust by explaining key aspects of the predictions made by trained models. For example, we might wonder, does the model make decisions for the right reasons? When we can reveal the inner workings behind a given prediction of the model, we can be more informed on whether the prediction is trustworthy or not.

Similarly, we might want to know, did our model learn a pattern that we overlooked but might find useful? If our model made a mistake, why? How can we teach the model not to make the same kind of mistake in the future?

Figure 1: (left) an example of an explanation for the query, "why was the image classified as a sports car?" (right) an example of an explanation for the query, "why was the image classified as a sports car rather than a convertible?"

Answering these high-level questions requires the ability to make a rich set of queries that help us learn about a model’s predictive behavior. Primarily, an explanation framework is meant to give us the means to make such queries. An explanation framework’s utility comes from its ability to express and accurately answer queries. Relative to the growing body of prior work on DNN explanations, our work provides greater utility by generalizing to a larger space of explanations that admit a more diverse set of queries. Particularly, we consider the following axes:

Quantity of interest | The quantity of interest lets us specify what we want to explain. Often, this is the output of the network corresponding to a particular class, addressing, e.g., why did the model classify the image in Figure 1 as a sports car? However, we could also consider various combinations of outputs, allowing us to ask more specific questions, such as, why did the model classify this image as a sports car and not a convertible? As shown in Figure 1, the former may highlight general “car features,” such as tires, while the latter (called a comparative explanation) might focus on the roof of the car, a “car feature” not shared by convertibles. Our work is the first to provide support for comparative explanations.

Distribution of interest | The distribution of interest lets us specify the set of instances over which we want our explanations to be faithful. In some cases, we may want to explain the model’s behavior on a particular instance, whereas other times we may be interested in a more general behavior over a class of instances. Previous explanation frameworks have tended to work only for individual instances.

Internal Slice | The slice, or layer, of the network provides flexibility over the level of abstraction for the explanation. In a low layer, an explanation may highlight the edges that were most important in identifying an object like a face, while in a higher layer, the explanation might highlight high-level features such as a nose or mouth. We found that by raising the level of abstraction, we were able to produce explanations that generalized over larger sets of instances.

Influence and Interpretation

One clear use case for explanations is for human consumption. In order to be fully leveraged by humans, explanations need to be interpretable. A large vector of numbers doesn’t in general make us more confident we understand what a network is doing. Previously, there have been two roughly separate lines of work within the subfield of DNN transparency: interpretation techniques, and influence measures.

Interpretation | At a high level, deep learning is simply computing higher and higher level features that make data points increasingly separable by class. Zeiler et al. [4] demonstrated that some of the features learned by CNNs learning object-recognition tasks may appeal to human intuition: edges, shapes, and eventually high-level concepts like eyes or car tires. The method of Zeiler et al. can be considered an interpretation technique; namely, it gives us a better understanding of what a feature learned by a deep network represents. Interpretation techniques give us somewhat of a view of what a network has learned, but they don’t inform us of the causal relationships between the features and the outcome of the model, i.e., they give us an idea of what the high-level features are, but not how they are used.

Influence | On the other hand, influence gives us an idea of how features are used, but doesn’t tell us what the features represent. This has served as a severe limitation for prior work, e.g., [5], [6], [7], as prior work on influence measures has accordingly focused only on readily interpretable features, essentially the inputs.

Our work unifies these two ideas, allowing us to make the influence of internal features meaningful. We view an explanation as comprised of both an influence measure and an interpretation technique: the influence of a (possibly internal) feature is measured, and then the feature is interpreted, providing a meaningful explanation of what the feature represented, and how it was used. This can be thought of as guiding our choice of which features to interpret based on the importance of each feature. Practically, this means that we can take a large explanation, as could be produced by a technique like Integrated Gradients [6] or GradCAM [7], and decompose it into its components, as viewed by the network. This gives us a finer understanding of how the network is operating. Without this decomposition, explanations tend be be overly vague, and are therefore far less useful than the more nuanced explanations our framework is uniquely capable of. An example of this decomposition is shown in Figure 2; the boxed image on the far right of Figure 2 shows a typical explanation produced by Saliency Maps [5], which is notably less informative.

Figure 2: (left) image to be explained (center) decomposed explanation given by Internal Influence [1] (right) explanation produced by Saliency Maps [5].

Explanations at Work

Explanations are a key stepping stone on the path towards algorithmic transparency. Explanations have direct applications towards auditing models that are deployed increasingly in critical or sensitive areas, e.g., medical diagnoses, credit and loan decisions, etc. Furthermore, we anticipate explanations will aid in debugging arbitrary model behavior, and in combining machine and human efforts to solve complicated problems.

Finally, while use via human consumption is perhaps the most natural use for explanations, explanations can be used by algorithms as well. In my upcoming work in ICLR 2019 [8], my colleagues and I demonstrate that influence can be used to create a post-hoc regularizer that reduces undesirable bias in model predictions. My ongoing research also suggests that explanations have applications in areas such as security, data privacy, and fairness.


  1. Leino et al. "Influence-directed Explanations for Deep Convolutional Networks" arXiv
  2. De Fauw et al. "Clinically Applicable Deep Learning for Diagnosis and Referral in Retinal Disease" Nature
  3. Sharif et al. "Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-art Face Recognition" PDF
  4. Zeiler et al. "Visualizing and Understanding Convolutional Networks" arXiv
  5. Simonyan et al. "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps" arXiv
  6. Sundararajan et al. "Axiomatic Attribution for Deep Networks" arXiv
  7. Selvaraju et al. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization" arXiv
  8. Leino et al. "Feature-wise Bias Amplification" arXiv