monday.com cuts AI-agent test cycles 8.7x after LangSmith integration
Monday.com says LangSmith integration cut AI-agent evaluation loops 8.7x, from 162 seconds to 18 seconds, and expanded testing to hundreds of examples in minutes instead of hours.

monday.com reports a sharp acceleration in how its engineering teams evaluate AI agents after integrating LangSmith into monday Service, its AI Native Enterprise Service Management platform. The LangChain Blog guest post led by Group Tech Lead Gal Ben Arieh says the change produced "Speed: 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds)" and "Coverage: Comprehensive testing across hundreds of examples in minutes instead of hours."
The details appear in the LangChain Blog case study titled "monday Service + LangSmith: Building a Code-First Evaluation Strategy from Day 1," published as a guest post credited to monday.com on Feb. 18, 2026. The post frames the work as an engineering effort across monday.com’s global R&D org and credits Gal Ben Arieh for driving the eval strategy for the customer-facing AI service agents.
monday Service is described in the post as "an AI Native Enterprise Service Management (ESM) platform" and the core automation is presented as an "AI service workforce" — a "customizable, LangGraph-based, ReAct agent" intended to "automate and resolve inquiries across any enterprise service management use case." The guest post lists IT, HR, and Legal as example departments where customers can "tailor the agent to drive execution within any service department, by utilizing their own KB articles and tools."
The engineering process change at the center of the post is explicit: the team "embedded evaluations into the development cycle from the start" and declared that "Many teams treat evaluation as a last-mile check, but we made it a Day 0 requirement." That shift is paired with an "Evals as code" practice: "Evaluation logic managed as version-controlled production code with GitOps-style CI/CD deployment," the post states, signaling monday.com’s move to treat evaluation artifacts like software artifacts within repositories and pipelines.
Observability in production is also emphasized. The guest post describes "Agent observability: Real-time, end-to-end quality monitoring on production traces, using Multi-Turn Evaluators." The post positions Multi-Turn Evaluators as the mechanism to watch agent behavior on live traces rather than waiting for user-reported failures during Alpha or later testing.
While the post provides concrete timing and coverage figures, it does not name customers, disclose repository URLs, or quantify downstream effects on release cadence. The report itself notes the metrics but does not document the timing methodology behind the "162 seconds to 18 seconds" comparison, leaving room for verification of sample sizes and what precisely was measured.
If the reported 8.7x reduction in evaluation feedback loops and the shift to evals-as-code hold up under scrutiny, monday.com’s Day 0 requirement and LangSmith integration position monday Service to detect AI quality issues earlier and run far broader automated tests per iteration. The guest post concludes that the framework was designed "to catch AI quality issues before our users do," a claim engineers and product managers at monday.com will now need to operationalize and validate across customer deployments.
Know something we missed? Have a correction or additional information?
Submit a Tip

