SLS Copilot: Building Flexible Data Infrastructure for LLMs

Introduction

As large language model (LLM) applications evolve, the spotlight often lands on refining and adding features. Yet, a vital aspect frequently missed is: How can you effectively monitor, diagnose, and optimize live LLM applications? This article delves into our engineering practices and the insights we've gathered from developing the SLS SQL Copilot. We'll show you how to build a solid data infrastructure for LLM applications with Alibaba Cloud's SLS.

What Challenges Do We Face in Observability for LLM Application Development?

The Rise and Limitations of the Dify Platform

Dify has become a favored platform for crafting LLM applications, thanks to its visual workflow design and a vast ecosystem of widgets. Our team chose Dify to create our SQL Copilot application, which aids in generating and analyzing SQL queries intelligently. However, we quickly stumbled upon a major hurdle: Dify's lack of advanced observability features. As an observability team, the irony didn't escape us that we struggled to use the very tools we developed, leading us to establish a more encompassing data infrastructure for our LLM applications.

The Business Complexity of SQL Copilot

Collaboration among multiple subsystems: Our architecture integrates subsystems for analyzing requirements, generating SQL, validating quality, and diagnosing SQL.
Complex workflows: The layered, nested Dify workflows mean a single user request can initiate multiple child processes.
Dynamic content embedding: The extensive embedding of context and retrieval-augmented generation (RAG) adds complexity to system prompts, incorporating large volumes of dynamic content from the knowledge base.
High concurrency: The application must handle real-time query generation requests from many users simultaneously.

These complexities rendered Dify's built-in observability tools insufficient for our needs.

What Are the Three Major Observability Shortcomings of Dify?

1. Limited Query Capabilities

Dify's basic account queries fall short, only allowing searches by User ID or session ID. Our requirements included:

Keyword search: Quickly finding relevant sessions using keywords.
Multidimensional queries: Combining queries across dimensions like time range, error type, etc.
Fuzzy matching: Searching for SQL statement fragments and error messages with fuzzy logic. This limitation made pinpointing production issues a time-consuming task, often taking minutes to find specific records.

2. Lack of Trace Analysis

Tracing nested workflows and data flow was a hassle due to:

Challenges in tracing nested workflows: It's hard to grasp the details of child workflows.
Tedious tracing process: This required running separate queries, complicating and lengthening troubleshooting. A single user request could trigger multiple steps, and if the output was unexpected, inspecting each step could take over 30 minutes.

3. Unformatted Content Display

Dify's interface had several drawbacks:

Poor prompt readability: Context embeddings and RAG content appeared as unwieldy text blocks.
No smart format parsing: Input and output data in various formats were shown as raw strings. These issues slowed down quick reading and comprehension, making problem diagnosis less efficient.

What Infrastructure Challenges Lie Behind These Observability Pain Points?

1. Architectural Scalability

The increase in data volume and traffic spikes posed significant challenges:

Resource elasticity: PostgreSQL struggled with high concurrency, leading to latency issues.
Upgrade complexity: Scaling out online was difficult and risky, impacting performance.

2. Data Processing Capability

PostgreSQL faced bottlenecks with multidimensional queries and real-time analysis, particularly with large volumes of natural language text. We needed efficient and flexible ad hoc queries, which Dify couldn't provide.

3. Diverse Data Requirements

As LLM applications grew, different teams needed various data access, which Dify's limited capabilities couldn't satisfy.

How Did We Upgrade with an SLS-based Data Infrastructure?

Realizing that mere feature enhancements wouldn't cut it, we opted for a complete overhaul of our data infrastructure using SLS. The key benefits included:

Architectural scalability: SLS offers cloud-native, elastic scaling without interrupting business operations.
Advanced data processing: SLS excels in managing log-like data, supporting full-text search, multidimensional queries, and flexible analyses.
Diverse data access: SLS enables various data query and processing methods, catering to the needs of different teams.

Architectural Redesign

We implemented a dual-write strategy:

Business security: PostgreSQL remains our primary datastore, safeguarding core functions.
Data plane decoupling: Asynchronous writing to SLS boosts observability without affecting performance.
Functional load separation: PG handles online execution, while SLS takes care of offline analysis.

Capability Upgrade: Core SLS Features

SLS accommodates different data formats, enabling:

Fast full-text search
Multi-dimensional queries
Real-time insights and analysis With SLS, we've developed operational dashboards for essential metrics, such as interaction rates and user satisfaction.

Scenario Implementation: Streamlined Tracing and Problem Diagnosis

Leveraging our AICoding platform, OneDay, we crafted a streamlined diagnosis system utilizing SLS's strengths. This approach provided a clean interface for tracing queries and effectively displaying dynamic content. Impressively, we completed it in just one day.

What Does Our Production Practice Look Like? A Data-driven Loop for Quality Optimization

Issue Diagnosis and Quality Optimization Flow

Establishing a feedback loop is crucial for the development of SQL Copilot. We integrated SLS with DingTalk AI Table, organizing errors and identifying patterns efficiently. This synergy ensures data is used effectively for ongoing learning and enhancement.

Key Metrics and Evaluation System

Essential metrics include:

SQL success rate
SQL data return rate
Response time
User satisfaction This evaluation framework steers future iterations, ensuring we derive actionable insights.

Actual Performance Improvements

Root cause analysis accuracy jumped from 60% to 90%.
Issue resolution time dropped from three days to one.
User satisfaction rose from 3.2 to 4.3 on a five-point scale.

Key Takeaways and Future Outlook

New Trends in the AI Era

Simplified architecture: Managed data infrastructure lets teams concentrate on innovation.
Toolchain integration: AI streamlines the development process, boosting efficiency.
Data-driven quality loops: Ongoing improvement leads to superior outcomes.

Future Outlook

We aim to harness LLM technology for automatic diagnostic reports and intelligent root cause analysis systems. Collaborating with leading platforms will further cloud-native observability advancements.

Conclusion

This post shared our journey in creating a robust data infrastructure for LLM applications with SLS. As AI technology progresses, focusing on infrastructure is as crucial as innovating features. A solid foundation paves the way for consistent advancement in the intelligence domain. We believe comprehensive observability is key to the success of any LLM application.