Hospitals have begun using “decision support tools” powered by artificial intelligence that can diagnose disease, suggest treatment, or predict a surgery’s outcome. But no algorithm is correct all the time, so how do doctors know when to trust the AI’s recommendation?
A new study led by Qian Yang, assistant professor of information science in the Cornell Ann S. Bowers College of Computing and Information Science, suggests that if AI tools can counsel the doctor like a colleague – pointing out relevant biomedical research that supports the decision – then doctors can better weigh the merits of the recommendation.
The researchers will present the new study, “Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems,” at the Association for Computing Machinery CHI Conference on Human Factors in Computing Systems.
Previously, most AI researchers have tried to help doctors evaluate suggestions from decision support tools by explaining how the underlying algorithm works, or what data was used to train the AI. But an education in how AI makes its predictions wasn’t sufficient, Yang said. Many doctors wanted to know if the AI had been validated in clinical trials, which typically does not happen with these tools.
“A doctor’s primary job is not to learn how AI works,” Yang said. “If we can build systems that help validate AI suggestions based on clinical trial results and journal articles, which are trustworthy information for doctors, then we can help them understand whether the AI is likely to be right or wrong for each specific case.”
To develop this system, the researchers first interviewed nine doctors across a range of specialties, and three clinical librarians. They discovered that when doctors disagree on the right course of action, they track down results from relevant biomedical research and case studies, taking into account the quality of each study and how closely it applies to the case at hand.
Yang and her colleagues built a prototype of their clinical decision tool that mimics this process by presenting biomedical evidence alongside the AI’s recommendation. They used GPT-3, a pre-trained large language model, to find and summarize relevant research. ChatGPT is the better-known offshoot of GPT-3, which is tailored for human dialogue.
“We built a system that basically tries to recreate the interpersonal communication that we observed when the doctors give suggestions to each other, and fetches the same kind of evidence from clinical literature to support the AI’s suggestion,” Yang said.
The interface for the decision support tool lists patient information, medical history, and lab test results on one side, with the AI’s personalized diagnosis or treatment suggestion on the other, followed by relevant biomedical studies. In response to doctor feedback, the researchers added a short summary for each study, highlighting details of the patient population, the medical intervention, and the patient outcomes, so doctors can quickly absorb the most important information.
The research team developed prototype decision support tools for three specialities – neurology, psychiatry, and palliative care – and asked three doctors from each speciality to test out the prototype by evaluating sample cases.
In interviews, doctors said they appreciated the clinical evidence, finding it intuitive and easy to understand, and preferred it to an explanation of the AI’s inner workings.
“It's a highly generalizable method.” Yang said. This type of approach could work for all medical specialties and other applications where scientific evidence is needed, such as Q&A platforms to answer patient questions or even automated fact checking of health-related news stories. “I would hope to see it embedded in different kinds of AI systems that are being developed, so we can make them useful for clinical practice,” Yang said.
Co-authors on the study include doctoral students Yiran Zhao and Stephen Yang in the field of information science, and Yuexing Hao in the field of human behavior design. Volodymyr Kuleshov, assistant professor at the Jacobs Technion-Cornell Institute at Cornell Tech and in computer science in Cornell Bowers CIS, Fei Wang, associate professor of population health sciences at Weill Cornell Medicine, and Kexin Quan of the University of California, San Diego also contributed to the study.
The researchers received support from the AI2050 Early Career Fellowship and the Cornell and Weill Cornell Medicine’s Multi-Investigator Seed Grants.
By Patricia Waldron, a writer for the Cornell Ann S. Bowers College of Computing and Information Science.