Together AI's ATLAS: 400% Inference Speedup with Adaptive Learning
Together AI's ATLAS delivers a groundbreaking 400% inference speedup by learning from real-time workloads, revolutionizing AI performance for enterprises.

Breaking Through the AI Inference Performance Barrier
Enterprises expanding their AI deployments often hit a snag: static speculators. These limitations slow down performance as workloads change, leading to slower inference speeds. Together AI has launched ATLAS (AdapTive-LeArning Speculator System), a revolutionary solution that overcomes these challenges by offering a self-learning inference optimization feature. With ATLAS, enterprises can experience up to a 400% increase in inference performance, revolutionizing AI inference capabilities.
Why Do We Need Adaptive Speculators?
Static speculators, typically paired with large language models during inference, predict several tokens ahead. However, as an enterprise's AI usage evolves, these static models falter in maintaining accuracy. Tri Dao, chief scientist at Together AI, points out, "As companies scale, their workloads shift, and the benefits of speculative execution diminish." This drop in performance, known as workload drift, can lead to significant inefficiencies.
The Hidden Costs of Workload Drift
Workload drift comes with severe consequences. Enterprises face two main choices:
- Settle for poorer performance and longer latency
- Invest in retraining custom speculators, which quickly become obsolete
This hidden cost can hinder AI scaling, growth, and innovation.
How ATLAS Enhances Performance
ATLAS employs a dual-speculator architecture to tackle these issues effectively:
- Static Speculator: A comprehensive model trained on extensive data sets the baseline performance.
- Adaptive Speculator: A nimble model that learns from ongoing traffic, adapting to new domains and usage patterns in real time.
- Confidence-Aware Controller: This layer decides which speculator to use, optimizing performance by adjusting the speculation based on confidence scores, all without user input.
Ben Athiwaratkun, staff AI scientist at Together AI, explains, "The static speculator provides an initial speed boost until the adaptive speculator starts learning." As the adaptive model becomes more confident, its performance steadily improves, leading to substantial speed gains.
Unprecedented Performance Improvements
ATLAS has shown extraordinary performance, reaching 500 tokens per second on DeepSeek-V3.1 when fully adapted. These results challenge specialized inference chips, proving that software advancements can bridge the gap to custom hardware. "Software and algorithmic improvements can match specialized hardware capabilities," Dao remarks. This is crucial for businesses aiming to optimize costs without compromising on performance.
Key Performance Metrics:
- 400% speed increase over traditional static speculators.
- 500 tokens per second on top-tier GPU hardware.
- Additional benefits from Together's Turbo optimization suite, including an 80% speedup from FP4 quantization.
The Memory-Compute Tradeoff Explained
A key inefficiency in modern inference is the underutilization of compute capacity. During inference, much of the compute power is idle. "You're mostly using the memory subsystem during inference," Dao states. Speculative decoding allows models to generate multiple tokens at once, maximizing compute use while reducing memory access.
This smart strategy works like caching systems but with a unique twist. Instead of storing exact outputs, adaptive speculators learn model behavior patterns. Recognizing common sequences in token generation, they enhance prediction accuracy over time.
Real-World Applications of ATLAS
ATLAS benefits various enterprise scenarios, notably:
- Reinforcement Learning Training: Static speculators falter as policy distributions change. ATLAS adapts in real time, boosting training efficiency.
- Shifting Workloads: Enterprises often change their AI applications, which alters workload demands. For example, a business might move from using chatbots to employing AI for coding, necessitating adaptive speculators.
The Future of Inference Optimization
ATLAS is now accessible on Together AI’s dedicated endpoints at no extra charge, available to over 800,000 developers. This move towards adaptive optimization marks a significant shift in inference platform operations. As enterprises apply AI across various fields, transitioning from one-time trained models to systems that learn and adapt continuously will be crucial.
Together AI's focus on innovation and collaboration could reshape the inference landscape, indicating that adaptive algorithms on standard hardware might surpass specialized silicon at a lower cost. As this shift occurs, software optimization will likely become more important than hardware specialization.
Conclusion
Together AI's ATLAS system is setting a new standard for AI inference by tackling the issue of static speculators head-on. By harnessing adaptive learning, enterprises can achieve unmatched speed and efficiency, staying ahead in a fast-paced market. As this technology evolves, embracing these advancements will be key for businesses aiming to maximize their AI initiatives.
Related Articles

Cursor Launches Composer: AI Coding Model with 4X Speed Boost
Cursor's Composer LLM revolutionizes coding with a 4X speed boost, transforming AI-assisted programming and enhancing developer workflows.
Oct 30, 2025

MediaTek’s New Chromebook Chip: A Game Changer for 2026
MediaTek's Kompanio 540 chip promises transformative features for Chromebooks, focusing on battery life and fanless designs, ideal for the education sector.
Oct 30, 2025

Adobe MAX Conference: AI Partnerships and Platform Expansion
Discover key highlights from the Adobe MAX conference, where AI partnerships and platform expansion are set to transform the creative industry.
Oct 29, 2025
