Distributed Compute
Working Group​

Bacalhau
OVERVIEW

Bacalhau, the open-source platform that powers Expanso, stands as a testament to our unwavering commitment to open-source innovation. Designed to redefine the way we approach fast, cost-effective and secure computation, Bacalhau is rooted in community-driven advancements. Centered on executing tasks precisely where data is generated and stored, Bacalhau integrates seamlessly with existing Docker and WebAssembly (WASM) workflows. This strategy, called Compute Over Data (CoD), is set to revolutionize large-scale dataset processing.

Problems enterprises face today.

Large-scale data challenges:

volume
VOLUME

Individual devices generating or storing over 100GB might seem manageable. But when you scale this up to 10, 100, or even thousands of devices, the data volume can quickly reach terabytes or petabytes. This sheer volume can be challenging for traditional storage and processing systems to handle.

velocity
VELOCITY​

Devices, especially in the Internet of Things (IoT) landscape, can generate data at an incredibly fast rate. This rapid data generation can strain the ingestion capabilities of traditional systems.

Variety
VARIETY​

Data from different devices can come in various formats. Some might produce structured data, while others might produce unstructured or semi-structured data. Managing this variety can be complex.

veracity
VERACITY

With the increase in the number of devices, ensuring the accuracy and trustworthiness of data becomes a challenge.

Implications

Close
BOTTLENECKS
Traditional computing systems, which are not designed to handle such vast amounts of data, can experience bottlenecks in both storage and processing. This can lead to slow query performances, lag in data analysis, and overall system inefficiencies.
Close
INCREASED COSTS
Managing vast amounts of data requires more storage, more processing power, and more bandwidth. This can lead to increased infrastructure and operational costs.
Close
SECURITY CONCERNS
With more devices connected and generating data, there’s an expanded surface area for potential security breaches.
Close
DATA LOSS
Traditional systems might not be equipped with the necessary backup and recovery solutions for such vast amounts of data, leading to potential data loss.

How Bacalhau solves those challenges:

Distributed Computing Solution

Bacalhau offers a solution for these data-intensive problems by using distributed computing. Managing data close to where it’s generated or stored enhances processing speed, agility and security. Find out more in our documentation.

EXAMPLE USE CASES

Bacalhau excels in situations demanding high levels of data processing.

log vending
LOG PROCESSING

Sanitize and process application logs at source before centralizing, resulting in reduced transport costs, quicker insights, and strict data privacy compliance.

ml inference
EDGE ML TRAINING
Distributed ML training across remote devices without central data aggregation, leading to enhanced security, lower latency, and maintained model accuracy.
data warehouse
DISTRIBUTED DATA WAREHOUSE
Virtual data warehouse with federated queries across distributed sources, enabling faster insights, cost savings, and access to real-time data.
distributed fleet management
DISTRIBUTED FLEET MANAGEMENT
Query device fleets instantly using Bacalhau’s OSQuery with a SQL-like engine, all without data centralization. Container support to swiftly audit, configure, and monitor health. Benefit from enhanced uptime, faster issue resolution, and reduced engineering effort.
geo distributed files
PROCESSING OVER GEOGRAPHICALLY
Distributed Files Data processing across distributed storage and varied regions, resulting in significant cost savings, faster data processing, and minimized compliance risks.
Unreliable Network
RUNNING JOBS OVER UNRELIABLE NETWORKS
Decentralized job coordination for resilient execution across unstable networks, ensuring reliable job execution, fewer failures, and support for geo-distributed queues.
federated learning
FEDERATED LEARNING WITH ISOLATED DIVISIONS
Jointly train ML models without sharing raw data across divisions. Enjoy superior model accuracy and minimized regulatory risks.
shared machines
SHARED MACHINES
Coordinate experiments across shared computing resources without transferring data. Benefit from optimal hardware usage, accelerated research, and uncompromised data security.

Industry-wide applications.

minimal infrastructure maximum protection
SECURITY
For real-time threat analysis and response.
Financial Services
WEB SERVING
Manage high traffic loads and content distribution.
Web Serving
FINANCIAL SERVICES
Handle large datasets and analytics efficiently.
Industry IoT
IOT & EDGE COMPUTING
Efficiently process data from numerous connected devices.
distributed fleet management
FOG & MULTI-CLOUD
Optimize data movement & processing in hybrid environments.

For enterprises navigating the complexities of vast data, Bacalhau is the definitive answer. For organizations seeking enhanced benefits, including robust binaries, SLAs, and dedicated support.

Expanso offers a commercially supported version of Bacalhau tailored to meet enterprise-grade requirements.

For more information, visit our Documentation.

For an in-depth exploration, visit our Getting Started Tutorial.