Updates

UC San Diego's D&D Simulation Benchmarks LLMs, Finds Claude 3.5 Leads

UC San Diego used Dungeons & Dragons combat simulations to benchmark LLMs for long-term decision making, finding Claude 3.5 Haiku the top performer over GPT-4 and DeepSeek-V3.

Jamie Taylor2 min read
Published
Listen to this article0:00 min
Share this article:
UC San Diego's D&D Simulation Benchmarks LLMs, Finds Claude 3.5 Leads
Source: today.ucsd.edu

A UC San Diego research team built a tool-grounded Dungeons & Dragons simulation to test how large language models handle long-horizon, multiagent decision making in roleplaying combat. By coupling LLMs to a D&D rules engine, the researchers reduced hallucination and enforced mechanics so models had to follow initiative, track hit points and resources, and choose valid actions across multi-step encounters.

The study used canonical D&D combat scenarios pulled from published modules, including Goblin Ambush, the Kennel in Cragmaw Hideout, and Klarg’s Cave, and measured model performance on staying in character, correct action selection, and resource tracking. Over 2,000 experienced D&D players were recruited to play against the model agents, supplying human-versus-AI interaction data and real-world playtesting of emergent behaviors. The work was presented at NeurIPS 2025 and the team published an open review and paper on January 20, 2026.

Among the three evaluated models Claude 3.5 Haiku emerged as the best performer, with GPT-4 close behind and DeepSeek-V3 trailing. Performance gaps centered on multi-turn planning and bookkeeping: Claude 3.5 Haiku more reliably selected mechanically valid actions and kept better track of persistent quantities such as HP and spell slots across extended fights. GPT-4 showed strong tactical sense but was slightly more prone to small bookkeeping errors, while DeepSeek-V3 produced more inconsistent action choices in complex multiagent exchanges.

Beyond raw scores, the simulations produced personality-like behaviors from AI agents that matter at the tabletop. Players reported emergent taunting and roleplay from goblin agents, and the environment captured distinctions between opportunistic monster tactics and more cautious, resource-aware foes. Because the environment enforces rules via a separate engine, these behaviors did not rely on model hallucination of mechanics, making them more stable and predictable for use as NPCs or DM assistants.

AI-generated illustration
AI-generated illustration

For players and DMs this research offers practical signals. Tool-grounding shows a way to get AI to respect action economy and resource limits, which helps preserve challenge balance and prevents absurd rule-breaking. For designers of AI-powered campaign tools and automated encounter suites, the benchmark provides a measurable testbed for long-term planning and multiagent coordination. The large human playtest pool strengthens the case that these results generalize beyond toy examples.

Next steps outlined by the team include expanding the environment toward full campaign simulation and applying the method to other long-horizon multiagent tasks. That means you can expect progressively more competent AI NPCs and co-DM tools that actually track HP, conditions, and spell resources across sessions, while still needing a human eye on tricky rulings and campaign-level continuity. The open review and paper contain methodology details and example runs for anyone who wants to dig into the benchmarks or try similar experiments at their table.

Know something we missed? Have a correction or additional information?

Submit a Tip
Your Topic
Today's stories
Updated daily by AI

Name any topic. Get daily articles.

You pick the subject, AI does the rest.

Start Now - Free

Ready in 2 minutes

Discussion

More Dungeons & Dragons News