🦀 New: Expanso ❤️ OpenClaw - Try the AI coding assistant now! Learn More →
← Back to Blog

How Expanso Cloud Runs: Kubernetes, Cell-Based Architecture, and Practicing What We Preach

Earlier this week we announced self-hosted Expanso on Kubernetes. This post is the technical companion - how we actually built it, what the architecture looks like inside, and why we made the decisions we did.

We tell customers to process data where it lives - on edge devices, behind firewalls, across distributed infrastructure. Compute should be portable, resilient, and close to the source. That’s the whole pitch. So when we re-architected Expanso Cloud, we held ourselves to the same standard. It now runs on Kubernetes with a cell-based architecture that spins up fully functional orchestrators in under ten seconds.

Quick refresher: what Expanso Cloud actually does

If you already know the product, skip ahead. Otherwise - the way we built Expanso Cloud follows directly from how the product works, so it’s worth a quick rundown.

The foundation is Expanso Edge - a lightweight agent that runs on your hardware. It sits on-prem, on IoT devices, inside corporate networks, wherever your data lives. It does the actual data processing (transforming, filtering, routing) locally, without sending raw data anywhere it doesn’t need to go. This isn’t a toggle you flip. It’s baked into the architecture. Your data stays with you.

Each Edge node connects to an Expanso Orchestrator - the control plane that coordinates what runs where. You deploy a logging pipeline or a data transformation job, and the orchestrator persists that intent and distributes it to the edge nodes that match your criteria. It handles versioning, gradual rollouts, rollbacks, monitoring. A single orchestrator can coordinate tens of thousands of edge nodes - but it never sees, touches, or processes your data.

The orchestrator talks to edge nodes over NATS - a lightweight messaging system built for high-throughput bidirectional communication. Edge nodes connect inbound; the orchestrator pushes configuration outbound. That’s the entire data flow through the control plane.

Expanso Cloud is the fully managed experience on top of all this. You create an account, set up an org, provision orchestrators through the web UI. It’s the front door to the platform - and the piece we just re-architected.

From minutes to seconds: the architecture behind fast provisioning

Spinning up a new orchestrator - one that’s ready to accept edge node connections and deploy pipelines - used to take minutes. Now it takes under ten seconds. Getting there meant rethinking how we provision and manage orchestrators at the infrastructure level.

Cells: independent, isolated Kubernetes clusters

Expanso Cloud uses a cell-based architecture. Each cell is a fully independent, isolated Kubernetes cluster. When you create a new network (that’s our term for a managed orchestrator instance), the platform picks the right cell and provisions your orchestrator there.

Cells aren’t single-tenant. Each one hosts orchestrators for multiple customers, but with strict isolation at the Kubernetes level - separate namespaces, dedicated service accounts, independent secrets, isolated network routing. If something goes wrong in one cell, it doesn’t cascade to others. Blast radius is contained by design.

For us, cells give us the operational boundaries we need to scale, update, and maintain infrastructure without coordinated global deployments. They’re also the foundation for geographic distribution - we can place orchestrators closer to the infrastructure they’re coordinating.

The provisioning itself is Kubernetes-native. We built a custom operator around a CRD called ExpansoCloudNetworkInstance. When the platform creates this resource in a cell, the operator’s reconciliation loop kicks in and stands up everything the orchestrator needs:

  • A Kubernetes Deployment running the orchestrator with an embedded NATS broker
  • A ClusterIP Service
  • Traefik IngressRoutes for HTTP API access and raw TCP for NATS connections
  • Secrets for credentials and platform configuration
  • Persistent storage for orchestrator state

The operator handles the full lifecycle - creation, updates, health monitoring, teardown. It’s level-triggered, so it continuously reconciles toward the desired state. Pod crashes, it comes back. Secret rotates, the deployment rolls forward.

Provisioning is fast because creating a network is just creating a Kubernetes custom resource and letting the operator converge.

The controller: orchestrating the orchestrators

Between the Expanso Cloud web app and the cells sits the controller. It’s the piece of this re-architecture we’re most proud of, and it solves problems that don’t seem like problems until you’re managing hundreds of orchestrators.

The controller is an async service that manages orchestrator lifecycles across all cells. When you click “Create Network” in the UI, we don’t call a cell directly. Instead, we record the intent - your network config, capacity requirements, target version - and the controller picks it up. It selects the right cell, talks to the cell’s provisioning API, and monitors the result. If something fails, it retries with backoff.

This decoupling is what makes provisioning reliable, not just fast. The web app declares what should exist; the controller makes sure it does.

But the real value shows up beyond initial provisioning. The controller continuously monitors every orchestrator across every cell - health status, node counts, endpoint availability. And it handles the thing that gets genuinely painful at scale: rolling updates across hundreds of orchestrators.

When we ship a new orchestrator version, we don’t need to coordinate a synchronized deployment across every cell. We tell the controller: update from version X to version Y. It handles the rollout declaratively - same philosophy as Kubernetes Deployments - rolling through networks at a controlled pace, checking health at each step. Same thing when you update your orchestrator config through the platform. Whether it’s our ops team pushing a version or you changing a setting, the controller reconciles the desired state.

Same reconciliation pattern that makes Kubernetes operators work - declare desired state, let the system converge, handle failures gracefully - but applied at the platform level, not just infra.

Direct-to-cell: no global routing overhead

Once your network is provisioned, all traffic goes directly to the cell. Your orchestrator’s HTTP API endpoint and NATS endpoint point straight to the cell’s ingress - no global proxy, no centralized routing layer, no extra hop.

When you interact with your orchestrator - through the CLI, the API, or the Cloud UI - traffic goes directly to the cell. When your edge nodes connect over NATS, they connect directly to the cell. The global control plane is only involved during provisioning and lifecycle management.

The upshot: our global control plane can go down for maintenance and your running orchestrators don’t notice. Edge nodes keep processing, pipelines keep running, API calls keep working. Each cell is self-sufficient once provisioned.

Preparing for on-prem: the same architecture, everywhere

Building on Kubernetes and cells has a payoff beyond our own SaaS: we’re getting ready to offer the full Expanso Cloud experience as an on-prem deployment.

The orchestrator already doesn’t process your data - it’s purely coordination, and most users are fine with a managed orchestrator for that reason. But if you have strict data sovereignty requirements or air-gapped environments, even the control plane needs to run inside your perimeter.

Because every cell is a self-contained Kubernetes deployment - operator, CRDs, Helm charts, the full stack - offering on-prem means packaging what we already run. The same operator that provisions orchestrators in our managed cells works in your cluster. Same Helm charts, same reconciliation loops, same operational model. We’re not building a separate “enterprise edition.” We’re shipping you the same thing we use.

When we tell customers to deploy on portable, resilient infrastructure, our platform runs on the same stack. The Kubernetes environment we recommend for on-prem is the one we develop against every day.

Every developer on the team runs a full Expanso Cloud environment locally - not a simplified mock, not a Docker Compose approximation, but the actual cell-based architecture with the operator, provisioning, and monitoring. We test against the same topology that runs in production.

What this means for you

If you’re already running pipelines on Expanso: faster provisioning, better reliability through cell isolation, and orchestrator updates that don’t touch your running workloads. Direct-to-cell means your day-to-day doesn’t depend on global infrastructure.

If you’re evaluating us: we built Expanso Cloud on the same Kubernetes primitives and reconciliation patterns we recommend to customers, and we’re working toward shipping the same deployment artifacts for on-prem. Book a demo if you want to see it in action.

We’ll be at KubeCon talking about this in more detail - edge-to-orchestrator networking over NATS, managing thousands of network instances across cells, on-prem deployment model. Come find us, or just try Expanso Cloud and kick the tires yourself.

Free Guide: Edge Data Governance

Learn how to govern data across distributed environments - from edge to cloud - without sacrificing performance or compliance.

Download the Guide

Stay Updated

Follow us for more insights on distributed data control.