The recent collaboration between the U.S. Department of Defense’s Chief Digital and Artificial Intelligence Office and technology nonprofit Humane Intelligence has led to the successful conclusion of the Crowdsourced Artificial Intelligence Red-Teaming Assurance Program pilot. This program focused on testing large language model (LLM) chatbots used in military medical services, with the goal of enhancing military medical care through adherence to required risk management practices for AI use.
The findings from the program’s red-team test involving over 200 clinical providers and healthcare analysts revealed more than 800 potential vulnerabilities and biases in the LLMs being tested. This valuable data will play a crucial role in shaping policies and best practices for the responsible use of generative AI in military healthcare settings.
The Crowdsourced Artificial Intelligence Red-Teaming Assurance Program aimed to create a community of practice around algorithmic evaluations in collaboration with the Defense Health Agency and the Program Executive Office, Defense Healthcare Management Systems. In addition to the recent red-team test, the program also launched a financial AI bias bounty in 2024 focused on identifying unknown risks in LLMs, starting with open-source chatbots.
By leveraging crowdsourcing to gather insights from multiple stakeholders, the program has been able to generate a wealth of data that will inform the development of AI policies and practices within the Department of Defense. Continued testing of LLMs and AI systems through the CAIRT Assurance Program will be essential in accelerating AI capabilities and building confidence in AI use cases across the DoD.
Trust is a key factor in the successful integration of AI in clinical care. To ensure that LLMs meet the necessary performance standards and are viewed as useful, transparent, explainable, and secure by healthcare providers, collaboration between clinicians and developers is crucial. This collaborative approach can help identify and eliminate algorithmic bias, ultimately leading to more equitable healthcare delivery.
Dr. Matthew Johnson, the lead of the CAIRT program, highlighted the importance of this initiative in paving the way for future research, development, and assurance of GenAI systems within the Department of Defense. By gathering testing data, identifying areas for improvement, and validating mitigation options, the program is laying the groundwork for the deployment of AI technologies in military healthcare settings.
As we continue to explore the potential of AI in healthcare delivery, it is essential to remain vigilant in addressing bias and ensuring the safe and equitable use of AI algorithms. By working together to predict and mitigate potential areas of bias, clinicians and developers can create AI solutions that enhance patient care while upholding the highest standards of ethical and responsible AI use.