A. Dutta
Texas State University,
United States
Keywords: large language models, biomedical LLM, evaluation of LLM
Summary:
In this era of large language models (LLM), the medical/biomedical LLM has huge potential to augment/transform the healthcare industry. However, task-specific evaluation of these models is needed. Moreover, the evaluation criteria and standards must be developed before the deployment tackles real-life cases. Our work focuses on three specific tasks for evaluating three selected/recent biomedical LLMs: MedLM, BioMedGPT, and BioMedLM. We chose three specific use cases in the healthcare industry: generating easily explainable pathological report summaries for patients, generating prescribed drug/medicine explanations (easy) for patients, and using conversational mental health therapist assistants. Human evaluators are considered the gold standard, so we involved human/domain experts for this evaluation. Eventually, we will propose a rubric and framework for the evaluation of these three tasks. We will also include AutoRater evaluations. The Open Medical-LLM Leaderboard is also considered for the initial evaluation. We also explored zero-shot learning, few-shot learning, and fine-tuning approaches with these models.