Three Ways AI Is Learning to Understand the Physical World

Why AI Struggles with Physical Reality

Learn more about how puppeteer james ortiz brought rocky to life in projec...

Large language models have transformed how machines process language and abstract knowledge. Yet these same systems fail spectacularly when asked to predict what happens when you push a coffee cup near the edge of a table.

This fundamental gap is pushing investors and researchers toward world models. AMI Labs raised $1.03 billion and World Labs secured $1 billion in recent funding rounds.

Why Can't LLMs Understand Physical Reality?

The problem runs deeper than simple programming limitations. LLMs learn by predicting the next token in a sequence, a method that works brilliantly for language but provides no grounding in physical causality. They cannot reliably forecast the physical consequences of real-world actions, which makes them unsuitable for robotics, autonomous driving, and manufacturing applications.

Turing Award recipient Richard Sutton captured this limitation in an interview with podcaster Dwarkesh Patel. He warned that LLMs just mimic what people say instead of modeling the world, which limits their capacity to learn from experience and adjust to changes in physical environments.

Google DeepMind CEO Demis Hassabis called this phenomenon "jagged intelligence." AI systems solve complex math olympiads but fail at basic physics problems.

Three distinct architectural approaches have emerged to solve this challenge. Each offers different tradeoffs between computational efficiency, spatial accuracy, and real-time performance.

What Is JEPA and Why Does It Matter?

How Does JEPA Mimic Human Cognition?

The Joint Embedding Predictive Architecture (JEPA) takes a fundamentally different approach from traditional video prediction models. Instead of trying to predict every pixel in the next frame, JEPA learns compact, abstract representations that capture only the essential dynamics of a scene.

Consider watching a car drive down a street. Your brain tracks the vehicle's trajectory and speed without calculating the exact reflection of light on every leaf in the background.

For a deep dive on leadership skills for constant change: internal mastery, see our full guide

JEPA models reproduce this cognitive shortcut by learning a smaller set of latent features. These features represent core interaction rules while discarding irrelevant details. This design makes JEPA models remarkably robust against background noise and small input changes that break other architectures.

AMI Labs, which raised over $1 billion to develop this approach, has built its entire platform around JEPA's efficiency advantages.

For a deep dive on japanese chopsticks faux pas: a digital etiquette guide, see our full guide

Why Do Enterprises Choose JEPA for Mission-Critical Applications?

JEPA's computational efficiency translates directly into business value. The architecture requires significantly fewer training examples and runs with lower latency than pixel-level prediction models. These characteristics make it ideal for applications where split-second decisions matter.

Key advantages include:

Minimal memory footprint for edge deployment
Real-time inference capabilities for robotics and autonomous vehicles
Reduced training costs through efficient learning
Robust performance in noisy, unpredictable environments

AMI is partnering with healthcare company Nabla to deploy JEPA-based models in fast-paced clinical settings. The system simulates operational complexity and reduces cognitive load for medical professionals who need instant, reliable predictions without computational delays.

Yann LeCun, JEPA pioneer and AMI co-founder, explained that these world models are "controllable in the sense that you can give them goals, and by construction, the only thing they can do is accomplish those goals." This goal-oriented design provides the safety guarantees enterprises demand for physical world applications.

How Do Gaussian Splats Build Interactive Spatial Environments?

What Makes Gaussian Splats Different from Other World Models?

World Labs has taken a radically different path by focusing on generative models that create complete 3D spatial environments from scratch. Their approach uses Gaussian splatting, a technique that represents 3D scenes using millions of tiny mathematical particles defining geometry and lighting.

Unlike flat video generation, these 3D representations can be imported directly into standard physics engines like Unreal Engine. Users and AI agents can freely navigate and interact with these environments from any angle, creating truly immersive spatial experiences.

World Labs founder Fei-Fei Li identified the core problem this approach solves. She noted that LLMs are "wordsmiths in the dark," possessing flowery language but lacking spatial intelligence and physical experience. The company's Marble model gives AI systems that missing spatial awareness.

What Is the Business Case for Spatial AI?

The primary value proposition centers on drastically reducing the time and cost required to create complex interactive 3D environments. Traditional 3D modeling requires teams of artists working for months to build detailed scenes. Gaussian splat models can generate comparable environments from simple prompts in minutes.

This capability unlocks several high-value applications:

Rapid prototyping for industrial design and architecture
Scalable training environment generation for robotics
Interactive entertainment and spatial computing experiences
Virtual product demonstrations and customer experiences

Autodesk has backed World Labs heavily to integrate these models into their industrial design applications. The partnership signals strong enterprise demand for tools that accelerate 3D content creation while maintaining professional quality standards.

Gaussian splat models are not designed for split-second real-time execution. However, their ability to generate static, explorable environments makes them invaluable for spatial computing and design workflows.

How Does End-to-End Generation Create Synthetic Data?

How Does Continuous Generation Work?

The third approach uses end-to-end generative models that act as complete physics engines. Rather than exporting static 3D files to external simulators, these models continuously generate scenes, physical dynamics, and reactions on the fly.

DeepMind's Genie 3 and Nvidia's Cosmos exemplify this architecture. They ingest an initial prompt alongside a continuous stream of user actions, then generate subsequent frames with native physics calculations, lighting, and object reactions. The model itself becomes the simulation engine.

DeepMind demonstrated Genie 3's capabilities by showcasing strict object permanence and consistent physics at 24 frames per second. The system operates without relying on separate memory modules. This integrated approach simplifies the technical stack while providing unprecedented flexibility.

Why Does Synthetic Data Matter for Physical AI?

The killer application for end-to-end generation is synthetic data production at scale. Training autonomous vehicles and robots requires exposure to millions of scenarios, including rare edge cases that are dangerous or expensive to recreate physically.

Nvidia Cosmos uses this architecture to create synthetic data factories for physical AI reasoning. Autonomous vehicle developers can synthesize rare, dangerous conditions without the cost or risk of physical testing. A self-driving car can experience thousands of near-miss scenarios in simulation before encountering one on real roads.

Waymo, the Alphabet subsidiary developing autonomous vehicles, built its world model on top of Genie 3. The system generates training scenarios that would take years to collect from real-world driving, dramatically accelerating development cycles.

The tradeoff is computational cost. Continuously rendering physics and pixels simultaneously requires substantial processing power. However, the investment is necessary to achieve what Hassabis calls a "deep, internal understanding of physical causality" that current AI systems lack.

What Comes Next for World Models?

Will Hybrid Architectures Dominate the Market?

The three architectural approaches are not mutually exclusive. As the technology matures, hybrid systems are emerging that combine strengths from multiple methods.

Cybersecurity startup DeepTempo recently developed LogLM, a model integrating elements from both LLMs and JEPA to detect anomalies and cyber threats from security logs. LLMs will continue serving as the reasoning and communication interface, but world models are positioning themselves as foundational infrastructure for physical and spatial data pipelines.

This division of labor plays to each architecture's strengths.

What Do These Developments Mean for Your Business?

The massive funding rounds for AMI Labs and World Labs signal a fundamental shift in AI investment priorities. Businesses should consider several strategic questions:

Which physical world tasks could world models automate in your operations?
Do your use cases require real-time performance (JEPA), spatial creation (Gaussian splats), or synthetic data generation (end-to-end)?
How will world models integrate with your existing AI infrastructure?
What competitive advantages could early adoption provide?

Companies in robotics, manufacturing, logistics, and autonomous systems stand to gain the most immediate benefits. However, the technology's implications extend to any industry where AI needs to interact with or reason about physical spaces and objects.

From Words to Worlds: The Future of AI

The evolution from language models to world models represents AI's next major frontier. JEPA architectures offer unmatched efficiency for real-time applications. Gaussian splat models excel at creating explorable spatial environments. End-to-end generation provides the synthetic data factories needed to train physical AI at scale.

Each approach addresses different aspects of the same fundamental challenge: giving AI systems the physical understanding they need to operate safely and effectively in the real world. The billion-dollar investments flowing into this space reflect both the magnitude of the problem and the enormous market opportunity for solutions.

Continue learning: Next, explore ensoniq sd-1 32-voice vst emulation: free open source

Businesses that understand these architectural differences and align them with specific use cases will gain significant competitive advantages as AI moves from web browsers into physical spaces. The question is no longer whether AI will understand the physical world, but which approach will dominate which applications.