Fairness in Machine Learning in Health

Meeting Summary

Dr. Kadija Ferryman
Dr. Marzyeh Ghassemi

On Friday, October 11, 2019, Dr. Marzyeh Ghassemi (University of Toronto) and Dr. Kadija Ferryman (NYU Tandon School of Engineering) hosted an interdisciplinary gathering of scholars focused on the topic of fairness in machine learning in health at Data & Society Research Institute.

Fairness in machine learning in health has been a growing area of interest in academic research, including identifying problems as well as proposing remedies from both technical and ethical and social scientific standpoints. The overarching goal of this meeting was to bring social, scientific, and technical perspectives together to build an interdisciplinary community of researchers on fairness and machine learning in health.

Additional goals for this meeting included:
1) Gathering evidence for a white paper that details a set of interdisciplinary priorities and interventions for machine learning and health;
2) Creating a conference proposal(s) on this topic.

The outcomes are intended to catalyze the ML, fairness, and health community across disciplines and set an agenda that will guide future work.

Session Summaries

Opening Remarks: Dr. Kadija Ferryman, NYU
Together, we have reservations, concerns, and optimism about the increasing use of machine learning (ML) in health. We hope to identify and address questions of justice and fairness today. Movements like the one formerly known as FAT*ML help focus the conversation, but we need particular attention on Health + ML issues. Thinking of health as a social institution: what kinds of computational + technological rules need to be employed for justice and fairness? We need to look at social, legal, and ethical dimensions of this computational field, and we need to build interdisciplinary trusted thought partners.

Session 1: Evaluating Fairness

Presenter 1.1: Shalmali Joshi, Vector Institute | slides
Title: Evaluating Fairness in ML for Health

Summary: For safer deployment of machine learning in healthcare practice, ML researchers have to carefully think about evaluating ML models for trust, robustness, and fairness. We demonstrate a potential way in which unfairness creeps in causal effect estimation from observational data and suggest means to evaluate for `failing gracefully’.


  • Hardt, Price, and Srebro. “Equality of opportunity in supervised learning.” NeurIPS 2016.
  • Kallus, Zhou. “Residual unfairness in fair machine learning from prejudiced data.” ICML 2019.
  • Kusner, Matt J., et al. “Counterfactual fairness.” Advances in Neural Information Processing Systems. 2017.

Presenter 1.2: Berk Ustun, Harvard University | slides
Title: Fairness without Harm: Decoupled Classifiers with Preference Guarantees

Summary: In many medical applications, it is acceptable for machine learning models to make use of group attributes like gender and ethnicity. In this talk, Berk Ustun discusses why “fair” machine learning in these settings should aim to train the best model for each group without harming any group. He introduces preference-based notions of fairness that ensure the “fair use” of group attributes, and outline methods to learn classification models that satisfy these conditions for real-world problem with a large number of intersectional subgroups.


  • Ustun, Berk, Yang Liu, and David Parkes. “Fairness without harm: Decoupled classifiers with preference guarantees.” International Conference on Machine Learning. 2019.

Presenter 1.3: David Madras, University of Toronto | slides
Title: “Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer” by David Madras, Toniann Pitassi, and Richard Zemel

Summary: In many machine learning applications, there are multiple decision-makers involved, both automated and human. In this talk, David Madras explores how to model the interaction between the decision-makers, and the biases of the resulting decisions.

Presenter 1.4: Rajesh Ranganath, NYU | slides
Title: Black Box Model

Summary: This talk explains how to peer inside the black box model using interpretations that are meaningful in the context of the populations it seeks to represent. By testing population distributions, employing general information measures, and measuring marginal and conditional independence during conditional randomization tests in finite samples, the study examines whether lengths of stay in hospitals is independent of race, gender, and other select variables, measured alongside vitals.

Session 2: The Impacts of Fairness

Presenter 2.1: Harini Suresh, MIT | slides
Title: Deploying decision-aids: real-world considerations

Summary: This talk highlights the sources of downstream harm that arise in ML systems, even in dataset creation, and also in model building and deployment. Spotlighting that testflow performance is not the same as real-world implications, Harini Suresh shares early findings from research measuring human-ML trust among radiologists. The group was shown chest x-rays with recommendations from other radiologists alongside machine learning recommendations, and asked which were more credible, and therefore which were trusted diagnoses and therefore followed. The hope is that this study can help shape the design of better tools, and increase efficient and accurate healthcare decisions.

Presenter 2.2: Melissa McCradden, SickKids | slides
Title: Bioethics of Healthcare ML and Precision Medicine: Fairness concerns

Summary: This presentation reviews the bioethical concerns of ML-driven precision medicine initiatives and discusses potential strategies for mitigation to promote more beneficial applications of healthcare ML. Melissa brings a bioethics, transdisciplinary lens to AI Ethics in Healthcare, and is interested in explainability – how we deliver explainability goals and evidence goals, and communicate with affected groups, is essential to building trust across cultures with modifiable and non-modifiable risk factors.


  • Ferryman K, Pitcan M. Fairness in precision medicine. Data & Society. 2018 Feb.
  • Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. The New England journal of medicine. 2018 Mar 15;378(11):981.
  • Zorc JJ, Chamberlain JM, Bajaj L. Machine Learning at the Clinical Bedside—The Ghost in the Machine. JAMA pediatrics. 2019 May 13.
  • Lipton ZC. The mythos of model interpretability. arXiv preprint arXiv:1606.03490. 2016 Jun 10.
  • Páez A. The Pragmatic Turn in Explainable Artificial Intelligence (XAI). Minds and Machines. 2019:1-9.Tonekaboni S, Mazwi M, Laussen P, Eytan D, Greer R, Goodfellow SD, Goodwin A, Brudno M, Goldenberg A. Prediction of cardiac arrest from physiological signals in the pediatric icu. InMachine Learning for Healthcare Conference 2018 Nov 29 (pp. 534-550).
  • Saria S, Butte A, Sheikh A. Better medicine through machine learning: What’s real, and what’s artificial?.
  • Emanuel EJ, Wachter RM. Artificial Intelligence in Health Care: Will the Value Match the Hype?. Jama. 2019 Jun 18;321(23):2281-2.

Presenter 2.3: Irene Chen, MIT | slides
Title: Fixing Disparities in Health with Machine Learning

Summary: The existing healthcare system is rife with health disparities. Machine learning based on observational data has the potential to create and amplify a flawed and unjust system, but it also has the opportunity for meaningful change. Irene Chen and her team outline steps for machine learning researchers to address algorithmic and systemic bias in health.


  • Need and Goldstein, “Next generation disparities in human genomics: concerns and remedies.” Cell 2009
  • Chen and Wong, “Black Patients Miss Out on Promising Cancer Drugs.” Propublica 2018.
  • Chen, Szolovits, Ghassemi. “Can AI Help Reduce Disparities in General Medical and Mental Health Care?”. AMA Journal of Ethics 2019.
  • Chen, Johansson, Sontag. “Why is My Classifier Discriminatory?” NeurIPS 2018.
  • Deliu et al, “Identification of Asthma Subtypes Using Clustering Methodologies”, Pulmonary Therapy 2016.

Presenter 2.4: Stephen Pfohl, Stanford | slides
Title: Fair Machine Learning with Electronic Health Records

Summary: Stephen Pfohl reviews two of his recent papers on the use of techniques from fair machine learning to constrain clinical risk scores to satisfy statistical fairness criteria. He further discusses the capability for meaningful assessment of fairness in the clinical setting in the context of limitations of current approaches.


  • Stephen Pfohl, Ben Marafino, Adrien Coulet, Fatima Rodriguez, Latha Palaniappan, Nigam H. Shah. Creating Fair Model of Atherosclerotic Cardiovascular Disease. AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2019.
  • Stephen Pfohl, Tony Duan, Daisy Yi Ding, Nigam H. Shah. Counterfactual Reasoning for Fair Clinical Risk Prediction. Machine Learning for Healthcare, 2019.

Session 3: Intersectional Work: Fair ML in Health

Presenter 3.1: Marzyeh Ghassemi, University of Toronto/Vector Institute | slides
Title: Learning Healthy Models for Health Care

Summary: Should we even be applying ML to health? Ghassemi argues that there is value in using complex models to understand complex data that can help improve healthcare for all. ML can do well on specific tasks in several domains. Let’s not dismiss it! Medical professionals show biases that reflect those of society. In building ML research, we must incorporate ways to mitigate bias. Joined by William Boag, they share findings from their research on Ontario immigrants disparate ICU End of Life. Using a ML model trained on coded, interpersonal relationship hospital notes to track racial disparities in treatments, they found differences in treatment when gender matches the doctors in charge.

Presenter 3.2: Emma Pierson, Stanford
Title: Using machine learning to explain racial and socioeconomic differences in pain.

Summary: Emma Pierson and her team uses a machine learning approach to explain the higher levels of knee osteoarthritis pain experienced by black patients and other disadvantaged groups.

Presenter 3.3: Sarah Tan, Cornell | slides
Title: A Tale of Two Risk Scoring Models

Summary: Many risk scoring models are trained, validated, tested on data different from where they’re eventually used. How does this impact their performance? Using secondary datasets, not the primary set used to train the model, this research looks at accuracy levels when applied to different populations, and the surprising results of using broader training data, then personalizing for each country vs. training on a more specific population and generating broadly applied risk scoring models.


  • S Tan, R Caruana, G Hooker, Y Lou. “Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. AAAI/ACM AI, Ethics, and Society Conference (2018).
  • SAPS-II paper: Le Gall et al., “A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study”, JAMA (1993)
  • FRAX paper (one of many): Kanis et al., “The use of clinical risk factors enhances the performance of BMD in the prediction of hip and osteoporotic fractures in men and women”, Osteoporos. Int. (2007)

Presenter 3.4: Charles Esenwa, Montefiore Hospital | slides
Title: ‘Stroke Identification in Vulnerable Populations using Claims Data’

Summary: Stroke research and quality improvement using administrative data is dependent on accurate identification of stroke cases. Despite recent advances in clinical informatics, stroke identification algorithms utilizing administrative claims datasets have varying accuracy across healthcare systems and patient populations. In the presentation, Esenwa describes two machine learning models, and applied them to claims data coded using the International Classification of Disease (ICD) system, to study the retrospective identification accuracy of patients with acute ischemic stroke.


  • Esenwa, C., Ilunga Tshiswaka, D., Gebregziabher, M. & Ovbiagele, B. Historical Slavery and Modern-Day Stroke Mortality in the United States Stroke Belt. Stroke; a journal of cerebral circulation49, 465-469, doi:10.1161/STROKEAHA.117.020169 (2018)

Closing Remarks: Dr. Marzyeh Ghassemi, University of Toronto/Vector Institute

Our goals after today are to continue this big conversation about fairness in health and how Machine Learning plays a part. Looking ahead, we’ll be forming small working group for future plans.