Updates

UC San Diego's D&D Simulation Benchmarks LLMs, Finds Claude 3.5 Leads

UC San Diego used Dungeons & Dragons combat simulations to benchmark LLMs for long-term decision making, finding Claude 3.5 Haiku the top performer over GPT-4 and DeepSeek-V3.

Jamie Taylor•1/23/2026•2 min read

Published 03:13 PM

Listen to this article•0:00 min

Share this article:

UC San Diego's D&D Simulation Benchmarks LLMs, Finds Claude 3.5 Leads — Source: today.ucsd.edu

A UC San Diego research team built a tool-grounded Dungeons & Dragons simulation to test how large language models handle long-horizon, multiagent decision making in roleplaying combat. By coupling LLMs to a D&D rules engine, the researchers reduced hallucination and enforced mechanics so models had to follow initiative, track hit points and resources, and choose valid actions across multi-step encounters.

The study used canonical D&D combat scenarios pulled from published modules, including Goblin Ambush, the Kennel in Cragmaw Hideout, and Klarg’s Cave, and measured model performance on staying in character, correct action selection, and resource tracking. Over 2,000 experienced D&D players were recruited to play against the model agents, supplying human-versus-AI interaction data and real-world playtesting of emergent behaviors. The work was presented at NeurIPS 2025 and the team published an open review and paper on January 20, 2026.

Among the three evaluated models Claude 3.5 Haiku emerged as the best performer, with GPT-4 close behind and DeepSeek-V3 trailing. Performance gaps centered on multi-turn planning and bookkeeping: Claude 3.5 Haiku more reliably selected mechanically valid actions and kept better track of persistent quantities such as HP and spell slots across extended fights. GPT-4 showed strong tactical sense but was slightly more prone to small bookkeeping errors, while DeepSeek-V3 produced more inconsistent action choices in complex multiagent exchanges.

Beyond raw scores, the simulations produced personality-like behaviors from AI agents that matter at the tabletop. Players reported emergent taunting and roleplay from goblin agents, and the environment captured distinctions between opportunistic monster tactics and more cautious, resource-aware foes. Because the environment enforces rules via a separate engine, these behaviors did not rely on model hallucination of mechanics, making them more stable and predictable for use as NPCs or DM assistants.

For players and DMs this research offers practical signals. Tool-grounding shows a way to get AI to respect action economy and resource limits, which helps preserve challenge balance and prevents absurd rule-breaking. For designers of AI-powered campaign tools and automated encounter suites, the benchmark provides a measurable testbed for long-term planning and multiagent coordination. The large human playtest pool strengthens the case that these results generalize beyond toy examples.

Next steps outlined by the team include expanding the environment toward full campaign simulation and applying the method to other long-horizon multiagent tasks. That means you can expect progressively more competent AI NPCs and co-DM tools that actually track HP, conditions, and spell resources across sessions, while still needing a human eye on tricky rulings and campaign-level continuity. The open review and paper contain methodology details and example runs for anyone who wants to dig into the benchmarks or try similar experiments at their table.

Know something we missed? Have a correction or additional information?

Submit a Tip

UC San Diego's D&D Simulation Benchmarks LLMs, Finds Claude 3.5 Leads

Name any topic.
Get daily articles.

Discussion (0 Comments)

More Dungeons & Dragons News

Dragons, Dungeons and Drinks Meetup Becomes Mutual Aid Network After ICE Actions

Wizards of the Coast unveils 2026 Seasons roadmap with Ravenloft, Arcana Unleashed

WizKids Reveals D&D March 2026 Slate Through March 25 Including Nolzur’s, Replicas

Name any topic. Get daily articles.