On Friday, October 11, 2019, Dr. Marzyeh Ghassemi (University of Toronto) and Dr. Kadija Ferryman (NYU Tandon School of Engineering) hosted an interdisciplinary gathering of scholars focused on the topic of fairness in machine learning in health at Data & Society Research Institute.
Fairness in machine learning in health has been a growing area of interest in academic research, including identifying problems as well as proposing remedies from both technical and ethical and social scientific standpoints. The overarching goal of this meeting was to bring social, scientific, and technical perspectives together to build an interdisciplinary community of researchers on fairness and machine learning in health.
Additional goals for this meeting included:
1) Gathering evidence for a white paper that details a set of interdisciplinary priorities and interventions for machine learning and health;
2) Creating a conference proposal(s) on this topic.
The outcomes are intended to catalyze the ML, fairness, and health community across disciplines and set an agenda that will guide future work.
Opening Remarks: Dr. Kadija Ferryman, NYU
Together, we have reservations, concerns, and optimism about the increasing use of machine learning (ML) in health. We hope to identify and address questions of justice and fairness today. Movements like the one formerly known as FAT*ML help focus the conversation, but we need particular attention on Health + ML issues. Thinking of health as a social institution: what kinds of computational + technological rules et.al need to be employed for justice and fairness? We need to look at social, legal, and ethical dimensions of this computational field, and we need to build interdisciplinary trusted thought partners.
Summary: For safer deployment of machine learning in healthcare practice, ML researchers have to carefully think about evaluating ML models for trust, robustness, and fairness. We demonstrate a potential way in which unfairness creeps in causal effect estimation from observational data and suggest means to evaluate for `failing gracefully’.
Summary: In many medical applications, it is acceptable for machine learning models to make use of group attributes like gender and ethnicity. In this talk, Berk Ustun discusses why “fair” machine learning in these settings should aim to train the best model for each group without harming any group. He introduces preference-based notions of fairness that ensure the “fair use” of group attributes, and outline methods to learn classification models that satisfy these conditions for real-world problem with a large number of intersectional subgroups.
Summary: In many machine learning applications, there are multiple decision-makers involved, both automated and human. In this talk, David Madras explores how to model the interaction between the decision-makers, and the biases of the resulting decisions.
Summary: This talk explains how to peer inside the black box model using interpretations that are meaningful in the context of the populations it seeks to represent. By testing population distributions, employing general information measures, and measuring marginal and conditional independence during conditional randomization tests in finite samples, the study examines whether lengths of stay in hospitals is independent of race, gender, and other select variables, measured alongside vitals.
Summary: This talk highlights the sources of downstream harm that arise in ML systems, even in dataset creation, and also in model building and deployment. Spotlighting that testflow performance is not the same as real-world implications, Harini Suresh shares early findings from research measuring human-ML trust among radiologists. The group was shown chest x-rays with recommendations from other radiologists alongside machine learning recommendations, and asked which were more credible, and therefore which were trusted diagnoses and therefore followed. The hope is that this study can help shape the design of better tools, and increase efficient and accurate healthcare decisions.
Summary: This presentation reviews the bioethical concerns of ML-driven precision medicine initiatives and discusses potential strategies for mitigation to promote more beneficial applications of healthcare ML. Melissa brings a bioethics, transdisciplinary lens to AI Ethics in Healthcare, and is interested in explainability – how we deliver explainability goals and evidence goals, and communicate with affected groups, is essential to building trust across cultures with modifiable and non-modifiable risk factors.
Summary: The existing healthcare system is rife with health disparities. Machine learning based on observational data has the potential to create and amplify a flawed and unjust system, but it also has the opportunity for meaningful change. Irene Chen and her team outline steps for machine learning researchers to address algorithmic and systemic bias in health.
Summary: Stephen Pfohl reviews two of his recent papers on the use of techniques from fair machine learning to constrain clinical risk scores to satisfy statistical fairness criteria. He further discusses the capability for meaningful assessment of fairness in the clinical setting in the context of limitations of current approaches.
Summary: Should we even be applying ML to health? Ghassemi argues that there is value in using complex models to understand complex data that can help improve healthcare for all. ML can do well on specific tasks in several domains. Let’s not dismiss it! Medical professionals show biases that reflect those of society. In building ML research, we must incorporate ways to mitigate bias. Joined by William Boag, they share findings from their research on Ontario immigrants disparate ICU End of Life. Using a ML model trained on coded, interpersonal relationship hospital notes to track racial disparities in treatments, they found differences in treatment when gender matches the doctors in charge.
Presenter 3.2: Emma Pierson, Stanford
Title: Using machine learning to explain racial and socioeconomic differences in pain.
Summary: Emma Pierson and her team uses a machine learning approach to explain the higher levels of knee osteoarthritis pain experienced by black patients and other disadvantaged groups.
Summary: Many risk scoring models are trained, validated, tested on data different from where they’re eventually used. How does this impact their performance? Using secondary datasets, not the primary set used to train the model, this research looks at accuracy levels when applied to different populations, and the surprising results of using broader training data, then personalizing for each country vs. training on a more specific population and generating broadly applied risk scoring models.
Summary: Stroke research and quality improvement using administrative data is dependent on accurate identification of stroke cases. Despite recent advances in clinical informatics, stroke identification algorithms utilizing administrative claims datasets have varying accuracy across healthcare systems and patient populations. In the presentation, Esenwa describes two machine learning models, and applied them to claims data coded using the International Classification of Disease (ICD) system, to study the retrospective identification accuracy of patients with acute ischemic stroke.
Our goals after today are to continue this big conversation about fairness in health and how Machine Learning plays a part. Looking ahead, we’ll be forming small working group for future plans.