business7 min read

Definity Embeds AI Agents in Spark Pipelines for Real-Tim...

Data pipeline failures don't just break dashboards anymore - they break AI systems. Definity embeds agents inside Spark pipelines to catch problems during execution, not after.

Definity Embeds AI Agents in Spark Pipelines for Real-Tim...

Why Are Data Pipeline Failures Now an AI Problem?

Learn more about abletonlive.aaf: convert aaf files to ableton live fast

Data pipeline failures used to mean delayed reports and frustrated analysts. Today, they mean broken AI systems and direct business impact. As enterprises deploy agentic AI that depends on fresh, accurate data, the stakes for pipeline reliability have fundamentally changed.

Definity, a Chicago-based data pipeline operations startup, is addressing this shift by embedding agents directly inside Spark and DBT drivers to act during pipeline runs, not after they complete. The company announced Wednesday that it raised $12 million in Series A financing led by GreatPoint Ventures, with participation from Dynatrace and existing investors StageOne Ventures and Hyde Park Venture Partners.

One enterprise customer identified 33% of its optimization opportunities in the first week of deployment and cut troubleshooting effort by 70%. The company also claims customers resolve complex Spark issues up to 10x faster than traditional monitoring approaches.

What Monitoring Gap Can't Traditional Tools Close?

Existing pipeline monitoring tools operate from outside the execution layer. Platforms like Datadog, Databricks system tables, Unravel Data, and Acceldata read metrics after a job completes. By that time, the failure, wasted compute, or corrupted data has already propagated downstream.

"It's always after the fact," Roy Daniel, CEO and co-founder of Definity, told VentureBeat in an exclusive interview. "By the time you know something happened, it already happened."

For data engineering teams, this reactive approach creates a familiar pattern: wait for an alert, manually trace failures across distributed jobs and clusters, then fix problems that have already impacted the business. Agentic AI systems that depend on timely, clean data can't tolerate this lag.

A pipeline that fails silently or delivers stale data doesn't just break a dashboard. It breaks the AI system depending on it.

What Makes Agentic Data Operations Different?

Daniel identifies three requirements for effective agentic data operations that traditional monitoring can't deliver:

  • Full stack context that is real-time and production-aware
  • Direct control of the pipeline during execution
  • Ability to validate changes in a feedback loop

"Without that, you can be outside looking in and read only," Daniel explained.

How Do In-Execution Agents Work Inside Your Pipelines?

For a deep dive on bugs rust won't catch: beyond memory safety guarantees, see our full guide

Definity's architectural approach differs fundamentally from external monitoring platforms. The agent sits inside the pipeline rather than watching from outside it.

Inline Instrumentation

For a deep dive on spacex rocket stage to slam into moon in august 2022, see our full guide

The Definity system installs a JVM agent directly inside the pipeline execution layer via a single line of code. It runs below the platform layer and pulls execution data directly from Spark as the job processes.

Execution Context During the Run

The agent captures query execution behavior, memory pressure, data skew, shuffle patterns, and infrastructure utilization in real time. It also infers lineage between pipelines and tables dynamically, eliminating the need for predefined data catalogs.

Intervention, Not Just Observation

This is where the approach diverges most sharply from traditional monitoring. The agent can modify resource allocation mid-run, stop a job before bad data propagates, or preempt a pipeline based on upstream data conditions.

Daniel described one production deployment where the agent detected that an upstream job had been preempted and the input table it was supposed to write was stale. The system stopped the downstream pipeline before it started, preventing bad data from reaching any dependent system.

Performance and Data Residency

The agent adds approximately one second of compute overhead on an hour-long run. Only metadata transmits externally. Full on-premises deployment is available for environments where no metadata can leave the perimeter.

Detection and prevention operate in real time. Root cause analysis and optimization recommendations run on demand when an engineer queries the assistant, with full execution context already assembled.

How Did Nexxen Cut Troubleshooting by 70%?

Nexxen, an ad tech platform running large-scale Spark pipelines for mission-critical advertising workloads, deployed Definity to address a different problem than most monitoring tools target.

"The main challenge wasn't about pipelines breaking, but about managing an increasingly complex and large-scale environment," Dennis Meyer, Director of Data Engineering at Nexxen, told VentureBeat. "Because we operate on-prem, we don't have the flexibility of instant elasticity, so inefficiencies have a direct cost impact."

Existing monitoring tools gave Nexxen partial visibility but not enough to act on systematically. "We had existing monitoring tools in place, but needed full-stack visibility to understand workload behavior holistically and to systematically prioritize optimizations," Meyer said.

Nexxen deployed Definity with no pipeline code changes. The team identified 33% of its optimization opportunities within the first week.

Engineering effort on troubleshooting and optimization dropped by 70%. The platform freed infrastructure capacity, allowing the team to support workload growth without additional hardware investment.

"The key shift was moving from reactive troubleshooting to proactive, continuous optimization," Meyer said. "At scale, the biggest gap often isn't tooling - it's actionable visibility."

What Does This Mean for Enterprise Data Strategy?

The shift from reactive monitoring to in-execution intelligence carries implications beyond the technical architecture.

Pipeline Operations Is Now AI Infrastructure

Data pipelines that previously supported analytics now carry AI workloads with direct business dependencies. Failures that were once an inconvenience now block production AI delivery. As enterprises scale agentic AI deployments, pipeline reliability becomes a competitive differentiator.

Troubleshooting Time Is a Recoverable Cost

For teams running lean, the 70% reduction in troubleshooting and optimization effort that Nexxen achieved represents time that can return to the roadmap. This is the most direct near-term business case for evaluating this category of tooling.

The Economics of On-Premises Efficiency

For organizations operating on-premises infrastructure, inefficiencies have direct cost impact without the buffer of elastic cloud capacity. The ability to identify optimization opportunities systematically changes the economics of scaling data workloads.

From Reactive to Proactive Operations

The organizational shift matters as much as the technical one. Teams move from firefighting to continuous optimization.

Engineers spend less time tracing failures across distributed systems and more time building capabilities that advance business objectives.

How Should You Evaluate In-Execution Monitoring for Your Team?

If you're considering this approach for your data infrastructure, start with these questions:

What Percentage of Your Engineering Time Goes to Pipeline Troubleshooting?

If senior engineers spend significant time tracing failures or optimizing underperforming jobs, calculate that cost. Compare it to the investment in tooling that could automate that work.

Do You Have AI Systems Depending on Pipeline Reliability?

If you're deploying or planning agentic AI that requires fresh, accurate data, evaluate your current ability to prevent failures before they propagate. Traditional monitoring may not be sufficient.

Are You Capacity-Constrained?

For on-premises environments or teams hitting cloud budget limits, systematic optimization that frees infrastructure capacity has direct financial impact.

Can You Tolerate Deployment Friction?

Solutions requiring extensive pipeline code changes face adoption barriers. Single-line deployment reduces implementation risk.

Where Does Definity Fit in the Competitive Landscape?

Definity enters a market with established players like Datadog (which acquired Metaplane), Databricks, Unravel Data, and Acceldata. Dynatrace's participation in the Series A is notable given its own monitoring capabilities.

The differentiation centers on architectural positioning: inside the execution layer versus observing from outside. Whether that architectural difference translates to sustainable competitive advantage depends on whether in-execution intervention becomes table stakes for AI-dependent pipelines.

The timing aligns with enterprise AI adoption. As more organizations deploy production AI systems with hard dependencies on data pipelines, the tolerance for reactive monitoring decreases. The market opportunity expands as the downstream cost of pipeline failures increases.

What Should Data Engineering Leaders Remember?

The shift to in-execution intelligence represents a fundamental change in how enterprises approach pipeline reliability. Traditional monitoring tools tell you what went wrong after it happened. In-execution agents prevent failures before they propagate.

For organizations deploying agentic AI, this capability moves from nice-to-have to essential. The business impact of pipeline failures has changed. The tooling needs to change with it.

The early results are compelling: 70% reduction in troubleshooting effort, 33% of optimization opportunities identified in the first week, and 10x faster resolution of complex Spark issues. These metrics suggest that in-execution monitoring delivers measurable ROI for enterprise data teams.


Continue learning: Next, explore court reverses pause on epic games ruling vs apple

As AI systems become more deeply embedded in business operations, the infrastructure supporting them requires the same level of reliability and observability. Data pipeline operations is no longer just about keeping dashboards current. It's about keeping AI systems running.

Related Articles

Comments

Sign in to comment

Join the conversation by signing in or creating an account.

Loading comments...