AWS is one of the largest deployments of compute services in the world. Like anyone of significant size, in order to maintain quality of service and reliability, they have to take as much signal from their services, often in the form of logs. In order to deliver high reliability services, they had to handle petabytes of logs generated every hour across Amazon’s massive global infrastructure supporting millions of customers.
Their users had diverse needs – real-time access to logs, long-term storage for compliance and efficient storage for large-scale batch processing – all without burning a hole in their pocket. The traditional approach, a centralized data lake, would have been prohibitively expensive, less efficient and a nightmare to operate. So, they went distributed. They would process logs where they were generated and orchestrated multiple delivery streams tailored to specific needs, whether that be real-time access or optimized storage for batch processing or long-term retention.
Thinking of achieving this level of efficiency and cost-saving outside of AWS? Our open-source product Bacalhau is the solution.
Challenges in traditional log vending.
Organizations today juggle the need for detailed logs with the challenges of managing massive data volumes. Platforms like Splunk, Data Dog, and others offer rich features, but costs spike with increased data intake. Key log management challenges across industries include the following.
VOLUME VS. VALUE
Most log ingestion involves write-intensive operations, but only a minor fraction of these logs are accessed or deemed valuable.
Critical applications like threat detection and system health checks rely on specific metrics. Real-time metric solutions tend to be costlier and harder to oversee. Hence, early aggregation and filtering of these metrics are essential for both cost savings and operational scalability.
At times, operators require access to application and server logs for immediate troubleshooting. This access should be facilitated without connecting directly to the numerous production servers producing these logs.
Preserving raw logs is crucial for compliance, audits, and various analytical and operational improvement processes.
To address these varied needs, especially in real-time scenarios, many organizations resort to streaming all logs to a centralized platform or data warehouse. While this method can work, it often leads to spiraling costs and slower insights.
Solution: Distributed Log Orchestration
Bacalhau is a distributed compute framework that offers efficient log processing solutions, enhancing current platforms. Its strength lies in its adaptable job orchestration. Let’s explore its multi-job approach:
Bacalhau ensures log management is not only efficient but also responsive to changing business needs.
Bacalhau's global edge.
Bacalhau is designed for global reach and reliability. Here’s a snapshot of its worldwide log solution:
LOCAL LOG TRANSFERS
Daemon jobs swiftly send logs to close-by storages like regional S3 or MinIO. They stay active, even without Bacalhau connection, safeguarding data during outages.
REGIONAL LOG HANDLING
Autonomous service jobs in each region channel logs to either local or global platforms, preserving metrics when the network’s down.
SMART BATCH OPERATIONS
Bacalhau guides batch jobs to nearby data sources, cutting network costs and streamlining global tasks.
OPS JOBS FLEXIBILITY
Based on permissions, operators can target specific hosts, regions, or the entire network for queries.
With Bacalhau, global log management is both efficient and user-friendly, marrying the perks of decentralization with centralized clarity.
In the following we have calculated the cost for a case where one uploads the data directly to Splunk and processes the data there. In the second scenario data is preprocessed on-device, uploaded both to an archive server and a S3 bucket with a EC2 unit in AWS. The savings are higher than 99%.
|TPS Per Host||10,000||r/s|
|Number of Hosts||30|
|Average Nginx Access Log Size||325||Bytes|
|Per Host Hourly Logs||11||GB|
|Per Host Hourly Compressed Logs (zstd)||0.8||GB|
|Fleet Wide Daily Raw Logs||7,845||GB|
|Fleet Wide Daily Compressed Raw Logs||541||GB|
|Average Aggregated Event Size||170||Bytes|
|Metrics Per Aggregation Window||100|
|Per Host Hourly Aggregated Logs||0.006||GB|
|Fleet Wide Daily Aggregated Logs||4||GB|
|Fleet Wide TPS||300||Events|
|Splunk Cloud Ingestion Prices (Ref)|
|Splunk annual 5GB/Day||$8,100||$4.4|
|Splunk annual 10GB/Day||$13,800||$3.8|
|Splunk annual 20GB/Day||$24,000||$3.3|
|Splunk annual 50GB/Day||$50,000||$2.7|
|Splunk annual 100GB/Day||$80,000||$2.2|
|Assuming 60% Discount For Higher Volume||$32,000||$0.9|
|Direct to Splunk||Bacalhau Pre-Processing|
|Splunk Daily Ingestion (GB)||7,845||4|
|Splunk Annual Cost||$2,510,548||$6,648|
|S3 Monthly New Storage (GB)||16,232|
|S3 Monthly PUTs||21,600|
|S3 Monthly GETs||64,800|
|S3 Annual Standard Storage (first month)||$373|
|S3 Annual Infrequent Access (next 2 months)||$406|
|S3 Annual Glacier Instant Retrieval (next 9 months)||$584|
|S3 Annual PUTs Cost||$1.30|
|S3 Annual GETs Cost||$0.31|
|S3 Annual Cost||$1,365|
|EC2 Monthly Orchestrator m7g.medium Instances (x3)||$90|
|EC2 Monthly Compute m7g.large Instances (x3)||$251|
|EC2 Annual Cost||$4,091|
|Kinesis Monthly Shard Cost (x3 Shards)||$33|
|Kinesis Monthly PUT Units Cost||$11|
|Kinesis Annual Cost||$527|
|Total Annual Cost||$2,510,548||$12,104|
Bacalhau cut the bandwidth for log transfers from 11GB to 800MB per host hourly – a ~93% reduction, maintaining data integrity and real-time metrics.
Direct logs to a service like Splunk would cost roughly $2.5 million annually. Bacalhau’s approach? A mere $12k per year, marking a cost slash of over 99%.
Using aggregated data, we crafted detailed dashboards for monitoring web services, tracking traffic shifts, and spotting suspicious users.
With Bacalhau, benefits span faster threat detection using Kinesis, intricate batch tasks with raw logs in S3, and long-term storage in Glacier. Plus, you can keep using your go-to log visualization and alert tools.
Step 0 - Prerequisites
- Bacalhau CLI installed. If you haven’t yet, follow this guide.
- AWS CDK CLI. You can find more info here.
- An active AWS account (any other cloud provider works too, but the commands will be different).
In this example, your cluster will include:
- A Bacalhau orchestrator EC2 instance
- Three EC2 instances as web servers running Bacalhau agents
- An S3 bucket for raw log storage
- An OpenSearch cluster with a pre-configured dashboard and visualizations
Step 1 - Cluster deployment with CDK
First, clone the GitHub repository:
Now, install the required Node.js packages:
Bootstrap your AWS account if you haven’t used AWS CDK on your account already:
To deploy your stack without SSH access, run:
Need SSH access to your hosts? Use this instead:
Note: If you don’t have an SSH key pair, follow these steps. Deployment will take a few minutes. Perfect time for a quick coffee break!
Step 2 - CDK Outputs
- OrchestratorPublicIp: Connect Bacalhau CLI to this IP.
- OpenSearchEndpoint: The endpoint for aggregated logs.
- BucketName: The S3 bucket for raw logs.
- OpenSearchDashboard: Access your OpenSearch dashboard here.
- OpenSearchPasswordRetriever: Retrieve OpenSearch master password with this command.
Step 3 - Access Bacalhau network
To configure your Bacalhau CLI, execute:
Verify your setup with:
You should see three compute nodes labeled service=web-server along with the orchestrator node.
Step 4 - Accessing OpenSearch dashboard
After your CDK stack is up and running, the OpenSearch dashboard URL will pop up in your console, courtesy of the CDK outputs. You’ll hit a login page the first time you try to access the dashboard. No sweat, just use admin as the username.
To get your password, you don’t have to hunt; CDK outputs include a handy command tailored for this. Just fire up your terminal and run:
Swap <Secret ARN> with the actual ARN displayed in your CDK outputs.
Logged in successfully? Fantastic, let’s proceed!
Step 5 - Deploy log generator
Check the job’s status:
Step 6 - Deploy logging agent
Logstash might need a few moments to get up and running. To keep tabs on its start-up progress, you can use:
Step 7 - How the logging agent works
- Raw Logs to S3: Every hour, compressed raw logs are sent to your specified S3 bucket. This is great for archival and deep-dive analysis.
- Real-time Metrics to OpenSearch: The agent pushes aggregated metrics to OpenSearch every AGGREGATE_DURATION seconds (e.g., every 10 seconds).
- Request Counts: Grouped by HTTP status codes.
- Top IPs: Top 10 source IPs by request count.
- Geo Sources: Top 10 geographic locations by request count.
- User Agents: Top 10 user agents by request count.
- Popular APIs & Pages: Top 10 most-hit APIs and pages.
- Gone Pages: Top 10 requested but non-existent pages.
- Unauthorized IPs: Top 10 IPs failing authentication.
- Throttled IPs: Top 10 IPs getting rate-limited.
- Data Volume: Total data transmitted in bytes.
Your mileage may vary depending on the specific types of logs, the volume, and your existing infrastructure. You can use this calculator to get an estimate of your cost saving. Compression and decompression have CPU costs. But given that we’re saving significantly on data transfer and storage, it’s often a cost well worth incurring.
So there you have it. With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But hey, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.
With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.
By decentralizing the log processing system, Bacalhau not only vastly reduces operational costs – as evidenced by our benchmarking results – but also ensures real-time data processing and compliance with archival needs. Its seamless integration across various cloud platforms, including AWS, Azure, and Google Cloud, demonstrates our commitment to versatile, cross-platform solutions.
While Bacalhau is open source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us.