Evaluating Large Language Models on Medical Evidence Summarization

August 24, 2023

The recent success of large language models (LLMs) has led to a shift in natural language processing (NLP) research. Models like GPT-3.5 and ChatGPT suggest potential for medical evidence summarization: a process wherein a model can follow human instructions and summarize medical evidence zero-shot. Zero-shot refers to a setup in deep learning wherein a model is used for a task different to the one it was originally trained for.

Previous studies have only analyzed this capability as it applies to news summarization and biomedical literature abstract generation. In an article in npj Digital Medicine, Dr. Yifan Peng, assistant professor of population health sciences, and colleagues assess the capabilities and limitations of large language models in performing zero-shot medical evidence summarization across six clinical domains.

To measure LLM performance, researchers used the abstracts from a series of Cochrane Reviews. Cochrane Reviews are reviews of research in health care and health policy. Their abstracts feature key methods, results, and conclusions of the review. Models were first given the entire abstract, excluding the Author’s Conclusions section, to summarize. The models were then given the Objectives and Main Results sections of the abstract to summarize. Researchers checked these summaries against summaries generated via human evaluation, which were similarly assessed for factual consistency, medical harmfulness, comprehensiveness, and coherence.

Results indicate that summaries generated by LLMs can contain misinterpretation errors and are incomprehensive, which can cause medical harm. The task of identifying the most important information from long contexts also remains a challenge for LLMs. At this point, human evaluation is essential in properly assessing the quality and factuality of medical evidence summaries generated by LLMs. Researchers find that more effective automatic evaluation methods need to be developed for this field.

Highlights

Evaluating Large Language Models on Medical Evidence Summarization

Follow us on X