Why 80% of Your Data Should Never Hit the Cloud
Let’s be honest: your pipelines are probably costing too much—in dollars, operational drag, and fragility. A default pattern forces teams to ship and store vast quantities of low-value data because the tooling offers no alternative.
We see it across industries:
- Security teams drowning in terabytes of logs they can’t analyze.
- Observability platforms racking up invoices for metrics nobody reads.
- Data lakes swelling with redundant, noisy, or useless data.
If you’re trying to cut cloud costs and keep visibility, you’re paying egress, ingestion, storage, and query fees on data with little or zero ROI. It’s time to rethink the model with Compute-Over-Data.
What this article covers:
- Escalating costs of “ship everything” pipelines
- Why most of this data shouldn’t be in hot storage (or the cloud at all)
- The real reason teams ship everything
- What a smarter model looks like
- The payoff
- The solution: Bacalhau
The Escalating Costs of Inefficient “Ship Everything” Pipelines
Centralizing everything into a warehouse, observability stack, or log platform triggers cascading costs on every byte:
- Data egress costs moving data out of its source.
- Data ingestion costs charged by downstream platforms.
- Data storage costs in hot/warm tiers.
- Query costs for compute time and scaling hardware.
Industry analyses show ~20% of stored data is never queried or used—yet you still pay full freight. Beyond money, centralizing creates latency, long transfer times, pipeline complexity, and higher security/compliance risk. Paying all that for unused data is broken.
Why Raw Data Overwhelms Hot Storage and Cloud Infrastructure
Hot tiers are premium (e.g., ~$0.018/GB/month for early Azure Blob volumes). Even cheaper object storage is wasted on unanalyzed data. If data is redundant or needs transformation first, forcing it through ship-and-store makes little sense. But teams often lack compute near the data, so they ship everything anyway.
The Root Cause: Why Teams Ship It All
Most tooling assumes centralization, abundant bandwidth, and big budgets:
- Log shippers forward raw/processed data upstream.
- Message queues buffer but don’t transform.
- Observability platforms price on ingest, nudging you to send more.
- Data warehouses/lakes require landing before processing.
When you can’t compute at the edge (no resources, no tooling), you accept higher cost and complexity.
A Smarter Architecture: Compute-Over-Data
Invert the flow: bring compute to the data source. Run filtering, transforms, enrichment, compression, even analytics at the node/edge before data enters expensive paths.
What you can do at the source:
- Filter verbose logs; keep only critical errors/security events.
- Compress with data-specific algorithms.
- Enrich with local context (instance IDs, geo, device metadata).
- Route based on content or edge decisions.
Benefits:
- Reduced data costs by not shipping low-value data.
- Better signal visibility by stripping noise early.
- Security & compliance by keeping sensitive data local and only sending what’s required.
Bacalhau: Making Compute-Over-Data Practical
Bacalhau is open source and runs compute where data lives—edge, cloud, or on-prem—via Docker or WASM jobs.
How it helps:
- Edge processing before egress/ingest fees.
- Data volume reduction: filter/aggregate/downsample before shipping.
- Flexible job types: batch, service, ops, daemon—run the right pattern for the workload.
- Disconnected-friendly: built for intermittent/edge environments; no need for constant central control.
Outcomes:
- Lower data and infra costs by moving less.
- Higher signal-to-noise for observability and security.
- Stronger compliance posture with data locality and selective movement.
Conclusion
“Ship everything” is unsustainable. Compute-Over-Data cuts costs and boosts agility by deciding what moves—and what doesn’t—at the source. Bacalhau gives you the orchestration layer to run containerized or WASM jobs wherever the data is, so 80% of low-value data never needs to hit the cloud.
What’s next?
- Install Bacalhau: Quick start or full installation.
- Try it hosted: Expanso Cloud.
- Set up networks: Network guides.
- Talk to us: Contact sales.
Get involved
Commercial support
Bacalhau is open source; binaries are built, signed, and supported by Expanso. For commercial support or pre-built binaries, contact us or get a license via Expanso Cloud.
