The ubiquity and power of machine learning models in society to determine and control an increasing number of real-world decisions presents a challenge. D&S fellow Sorelle Friedler and a team of researchers have developed a technique to do black-box auditing of machine-learning classification models to gain a deeper understanding of these complex and opaque model behaviors.
Abstract: Data-trained predictive models are widely used to assist in decision making. But they are used as black boxes that output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior: and in particular how different attributes influence the model prediction. This is very important when trying to interpret the behavior of complex models, or ensure that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models: we can study the extent to which existing models take advantage of particular features in the dataset without knowing how the models work. We show how a class of techniques originally developed for the detection and repair of disparate impact in classification models can be used to study the sensitivity of any model with respect to any feature subsets. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection.