Why Are Results Reported in Aggregate, Without Naming Individual Chatbots?
We do not disclose individual scores for specific chatbots because this is an industry-wide challenge. All major chatbots tested have been found to be vulnerable to repeating false claims. Reporting only aggregated results focuses on the broader systemic challenge rather than singling out a “worst offender.” Disclosing individual results could skew understanding of the problem by drawing undue attention to the lowest-performing model, rather than highlighting the flaws across the AI industry.
Indeed, there is a parallel here with earlier efforts to evaluate social media platforms. Rather than being isolated culprits, social media companies rely on the same deep-rooted, industry-wide design choices of engagement-driven algorithms that incentivize and amplify divisive, sensational, or false content because it maximizes clicks. “Naming and shaming” one tech company at a time distracts from those structural incentives and creates the illusion that the problem could be solved by a single platform, when the vulnerabilities are baked into the entire system.
This same dynamic is now applying to generative AI. Our audits have consistently found that the models share common underlying flaws, such as difficulties with breaking news and data voids, citing unreliable sources presenting themselves as local news outlets, and limited guardrails around misuse by malign actors. Some models perform better than others, however every model has consistently shown weaknesses. Just as online hoaxes were never unique to just, say, Facebook, AI information vulnerabilities are not confined to any one model.
However, as the industry advances and its participants become widely used consumer products NewsGuard may begin to identify individual scores.
Why Use Innocent, Leading, and Malign Actor Prompts?
Some observers have suggested that our “leading” or “malign actor” prompts are misleading and designed to provoke falsehoods. However, that is a feature, not a bug. These prompts are designed to reflect real-world use cases. The fact is, users do not always approach chatbots as neutral fact-seekers. Some arrive with agendas or assumptions and turn to chatbots to validate their underlying beliefs, and some bad actors deliberately seek ways to produce and disseminate false claims at scale in service to propaganda. For example, the Pravda Network in 2024 published 3.6 million articles—many AI-generated—spreading just 207 false claims, infecting the AI models with frequently repeated false claims.
Indeed, NewsGuard’s ongoing monitoring of the information ecosystem has identified these three scenarios in practice: innocent users being misled by inaccurate information, ideological actors citing chatbot outputs in response to their leading prompts to lend credibility to false claims, and foreign actors weaponizing AI to generate propaganda at scale.
Testing across these three personas aims to reflect the full spectrum of actual user behavior relating to news and information. Though the personas vary, they all ultimately aim to demonstrate the same core problem of inadequate protections that fail under real-world conditions.
While there are many audits of generative AI systems, this is the only recurring audit focused specifically on how chatbots handle provably false claims about breaking news, elections, and international conflicts. These are precisely the areas where users — whether curious citizens, confused readers, or malign actors — are most likely to turn to AI systems. For that reason, the use of leading and malign actor prompts is meant to test models under the conditions where the stakes are highest and where bad actors are already trying to exploit them.
Why Do Replications Sometimes Produce Different Results?
Attempting to replicate our audit weeks or months later will likely produce different outputs. This variability is not evidence that the audit was flawed, but rather evidence of how chatbots evolve in real time.
Chatbot responses are shaped by many factors, including:
- Timing: As new debunks and reliable reporting enter the information ecosystem, models begin to cite them. NewsGuard’s audits test the claims the month they were spreading.
- User factors: Subscription tier, location, language, and prior conversation history can all affect results.
- System updates: AI companies regularly adjust models and guardrails.Our audits capture how chatbots respond at the time false claims are spreading online.
To support transparency and independent verification, NewsGuard has, upon request, provided researchers with fresher false narratives and prompts they can test to verify these vulnerabilities for themselves. NewsGuard invites researchers, journalists, policymakers, and other stakeholders who engage with our audits to forward any questions you may have about our methodology, as well as any recommendations, feedback, or opportunities for collaboration. For any such inquiries, or to request access to prompts and examples for verification, contact McKenzie Sadeghi, NewsGuard’s editor for AI and Foreign Influence, at [email protected].