Log Vending
Introduction
AWS is one of the largest deployments of compute services in the world. Like anyone of significant size, in order to maintain quality of service and reliability, they have to take as much signal from their services, often in the form of logs. In order to deliver high reliability services, they had to handle petabytes of logs generated every hour across Amazon’s massive global infrastructure supporting millions of customers.
Their users had diverse needs – real-time access to logs, long-term storage for compliance and efficient storage for large-scale batch processing – all without burning a hole in their pocket. The traditional approach, a centralized data lake, would have been prohibitively expensive, less efficient and a nightmare to operate. So, they went distributed. They would process logs where they were generated and orchestrated multiple delivery streams tailored to specific needs, whether that be real-time access or optimized storage for batch processing or long-term retention.
Thinking of achieving this level of efficiency and cost-saving outside of AWS? Our open-source product Bacalhau is the solution.
Challenges in traditional log vending.
Organizations today juggle the need for detailed logs with the challenges of managing massive data volumes. Platforms like Splunk, Data Dog, and others offer rich features, but costs spike with increased data intake. Key log management challenges across industries include the following.
VOLUME VS. VALUE
Most log ingestion involves write-intensive operations, but only a minor fraction of these logs are accessed or deemed valuable.
REAL-TIME NEEDS
Critical applications like threat detection and system health checks rely on specific metrics. Real-time metric solutions tend to be costlier and harder to oversee. Hence, early aggregation and filtering of these metrics are essential for both cost savings and operational scalability.
OPERATIONAL INSIGHTS
At times, operators require access to application and server logs for immediate troubleshooting. This access should be facilitated without connecting directly to the numerous production servers producing these logs.
ARCHIVAL NEEDS
Preserving raw logs is crucial for compliance, audits, and various analytical and operational improvement processes.
To address these varied needs, especially in real-time scenarios, many organizations resort to streaming all logs to a centralized platform or data warehouse. While this method can work, it often leads to spiraling costs and slower insights.
Solution: Distributed Log Orchestration
Bacalhau is a distributed compute framework that offers efficient log processing solutions, enhancing current platforms. Its strength lies in its adaptable job orchestration. Let’s explore its multi-job approach:
Daemon Jobs:
- Purpose: Bacalhau agents run continuously on selected nodes, auto-deployed by the orchestrator.
- Function: These jobs handle logs at the source, aggregate, and compress them. They then send aggregated logs periodically to platforms like Kafka or Kinesis, or suitable logging services. Every hour, raw logs intended for archiving or batch processing move to places like S3.
Service Jobs:
- Purpose: Handle continuous intermediate processing like log aggregation, basic statistics, deduplication, and issue detection. They run on specific nodes, with Bacalhau ensuring their optimal performance.
- Function: Continuous log processing and integration with logging services for instant insights, e.g., Splunk.
Batch Jobs:
- Purpose: Run on-demand on chosen nodes, with Bacalhau managing node choice and task monitoring.
- Function: Operates intermittently on data in S3, focusing on in-depth investigations without moving the data, turning nodes into a distributed data warehouse.
Ops Jobs:
- Purpose: Similar to batch jobs but spanning all fitting nodes.
- Function: Executes ad-hoc queries across all jobs in a network, allowing end-users limited log access on host machines, avoiding any S3 transfer delays and ensuring rapid insights.
Bacalhau ensures log management is not only efficient but also responsive to changing business needs.
Bacalhau's global edge.
Bacalhau is designed for global reach and reliability. Here’s a snapshot of its worldwide log solution:
LOCAL LOG TRANSFERS
Daemon jobs swiftly send logs to close-by storages like regional S3 or MinIO. They stay active, even without Bacalhau connection, safeguarding data during outages.
REGIONAL LOG HANDLING
Autonomous service jobs in each region channel logs to either local or global platforms, preserving metrics when the network’s down.
SMART BATCH OPERATIONS
Bacalhau guides batch jobs to nearby data sources, cutting network costs and streamlining global tasks.
OPS JOBS FLEXIBILITY
Based on permissions, operators can target specific hosts, regions, or the entire network for queries.
With Bacalhau, global log management is both efficient and user-friendly, marrying the perks of decentralization with centralized clarity.
BENCHMARKING
In the following we have calculated the cost for a case where one uploads the data directly to Splunk and processes the data there. In the second scenario data is preprocessed on-device, uploaded both to an archive server and a S3 bucket with a EC2 unit in AWS. The savings are higher than 99%.
Environment | ||
---|---|---|
Fleet | ||
TPS Per Host | 10,000 | r/s |
Number of Hosts | 30 | |
Raw Logs | ||
Average Nginx Access Log Size | 325 | Bytes |
Per Host Hourly Logs | 11 | GB |
Per Host Hourly Compressed Logs (zstd) | 0.8 | GB |
Fleet Wide Daily Raw Logs | 7,845 | GB |
Fleet Wide Daily Compressed Raw Logs | 541 | GB |
Aggregated Logs | ||
Average Aggregated Event Size | 170 | Bytes |
Aggregation Window | 10 | seconds |
Metrics Per Aggregation Window | 100 | |
Per Host Hourly Aggregated Logs | 0.006 | GB |
Fleet Wide Daily Aggregated Logs | 4 | GB |
Fleet Wide TPS | 300 | Events |
Splunk Cloud Ingestion Prices (Ref) | ||
---|---|---|
Plan | Annual Total | GB/Day |
Splunk annual 5GB/Day | $8,100 | $4.4 |
Splunk annual 10GB/Day | $13,800 | $3.8 |
Splunk annual 20GB/Day | $24,000 | $3.3 |
Splunk annual 50GB/Day | $50,000 | $2.7 |
Splunk annual 100GB/Day | $80,000 | $2.2 |
Assuming 60% Discount For Higher Volume | $32,000 | $0.9 |
Direct to Splunk | Bacalhau Pre-Processing | |
---|---|---|
Splunk Daily Ingestion (GB) | 7,845 | 4 |
Splunk Annual Cost | $2,510,548 | $6,648 |
S3 Monthly New Storage (GB) | 16,232 | |
S3 Monthly PUTs | 21,600 | |
S3 Monthly GETs | 64,800 | |
S3 Annual Standard Storage (first month) | $373 | |
S3 Annual Infrequent Access (next 2 months) | $406 | |
S3 Annual Glacier Instant Retrieval (next 9 months) | $584 | |
S3 Annual PUTs Cost | $1.30 | |
S3 Annual GETs Cost | $0.31 | |
S3 Annual Cost | $1,365 | |
EC2 Monthly Orchestrator m7g.medium Instances (x3) | $90 | |
EC2 Monthly Compute m7g.large Instances (x3) | $251 | |
EC2 Annual Cost | $4,091 | |
Kinesis Monthly Shard Cost (x3 Shards) | $33 | |
Kinesis Monthly PUT Units Cost | $11 | |
Kinesis Annual Cost | $527 | |
Total Annual Cost | $2,510,548 | $12,104 |
Cost Saving | 99.52% |
Log Context: We benchmarked using web server access logs, common tools to monitor server health, user activity, and security. Our goal was to mirror real-world situations for wide applicability.
Setup and Capacity: We simulated 30 web servers across multiple zones, each handling about 10k TPS (Transactions Per Second) of web requests. This provides a solid foundation for gauging Bacalhau's log processing prowess.
Platform Considerations: Our tests were on AWS, but Bacalhau's flexibility means similar results on Google Cloud, Azure, or on-premises systems.
MAIN TAKEAWAYS
BANDWIDTH EFFICIENCY
Bacalhau cut the bandwidth for log transfers from 11GB to 800MB per host hourly – a ~93% reduction, maintaining data integrity and real-time metrics.
COST SAVINGS
Direct logs to a service like Splunk would cost roughly $2.5 million annually. Bacalhau’s approach? A mere $12k per year, marking a cost slash of over 99%.
REAL-TIME INSIGHTS
Using aggregated data, we crafted detailed dashboards for monitoring web services, tracking traffic shifts, and spotting suspicious users.
ADDED PERKS
With Bacalhau, benefits span faster threat detection using Kinesis, intricate batch tasks with raw logs in S3, and long-term storage in Glacier. Plus, you can keep using your go-to log visualization and alert tools.
Step 0 - Prerequisites
- Bacalhau CLI installed. If you haven’t yet, follow this guide.
- AWS CDK CLI. You can find more info here.
- An active AWS account (any other cloud provider works too, but the commands will be different).
In this example, your cluster will include:
- A Bacalhau orchestrator EC2 instance
- Three EC2 instances as web servers running Bacalhau agents
- An S3 bucket for raw log storage
- An OpenSearch cluster with a pre-configured dashboard and visualizations
Step 1 - Cluster deployment with CDK
Step-by-Step Deployment
First, clone the GitHub repository:
git clone
cd log-orchestration/cdk
Now, install the required Node.js packages:
npm install
Bootstrap your AWS account if you haven’t used AWS CDK on your account already:
cdk bootstrap
To deploy your stack without SSH access, run:
cdk deploy
Need SSH access to your hosts? Use this instead:
cdk deploy --context keyName=
Note: If you don’t have an SSH key pair, follow these steps. Deployment will take a few minutes. Perfect time for a quick coffee break!
Step 2 - CDK Outputs
- OrchestratorPublicIp: Connect Bacalhau CLI to this IP.
- OpenSearchEndpoint: The endpoint for aggregated logs.
- BucketName: The S3 bucket for raw logs.
- OpenSearchDashboard: Access your OpenSearch dashboard here.
- OpenSearchPasswordRetriever: Retrieve OpenSearch master password with this command.
Step 3 - Access Bacalhau network
To configure your Bacalhau CLI, execute:
export BACALHAU_NODE_CLIENTAPI_HOST=
Verify your setup with:
bacalhau node list
You should see three compute nodes labeled service=web-server along with the orchestrator node.
ID TYPE LABELS CPU MEMORY DISK GPU
QmUWYeTV Compute Architecture=amd64 Operating-System=linux 1.6 / 1.5 GB / 78.3 GB / 0 /
git-lfs=False name=web-server-1 service=web-server 1.6 1.5 GB 78.3 GB 0
QmVBdXFW Compute Architecture=amd64 Operating-System=linux 1.6 / 1.5 GB / 78.3 GB / 0 /
git-lfs=False name=web-server-3 service=web-server 1.6 1.5 GB 78.3 GB 0
QmWaRH4X Requester Architecture=amd64 Operating-System=linux
git-lfs=False
QmXVWfVT Compute Architecture=amd64 Operating-System=linux 1.6 / 1.5 GB / 78.3 GB / 0 /
git-lfs=False name=web-server-2 service=web-server 1.6 1.5 GB 78.3 GB 0
Step 4 - Accessing OpenSearch dashboard
After your CDK stack is up and running, the OpenSearch dashboard URL will pop up in your console, courtesy of the CDK outputs. You’ll hit a login page the first time you try to access the dashboard. No sweat, just use admin as the username.
To get your password, you don’t have to hunt; CDK outputs include a handy command tailored for this. Just fire up your terminal and run:
aws secretsmanager get-secret-value --secret-id "" --query 'SecretString' --output text
Swap <Secret ARN> with the actual ARN displayed in your CDK outputs.
Logged in successfully? Fantastic, let’s proceed!
Step 5 - Deploy log generator
Name: LogGenerator
Type: daemon
Namespace: logging
Constraints:
- Key: service
Operator: ==
Values:
- web-server
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: expanso/nginx-access-log-generator:1.0.0
Parameters:
- --rate
- "10"
- --output-log-file
- /app/output/application.log
Resources:
CPU: 0.1
Memory: 512MB
InputSources:
- Target: /app/output
Source:
Type: localDirectory
Params:
SourcePath: /data/log-orchestration/logs
ReadWrite: true
bacalhau job run log-generator.yaml
Check the job’s status:
bacalhau job executions
CREATED MODIFIED ID NODE ID REV. COMPUTE DESIRED COMMENT
STATE STATE
12:43:31 12:43:31 e-d5433aec QmUWYeTV 2 BidAccepte Running
d
12:43:31 12:43:31 e-a7410744 QmVBdXFW 2 BidAccepte Running
d
12:43:31 12:43:31 e-d6868bd2 QmXVWfVT 2 BidAccepte Running
d
Step 6 - Deploy logging agent
Name: Logstash
Type: daemon
Namespace: logging
Constraints:
- Key: service
Operator: ==
Values:
- web-server
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: expanso/nginx-access-log-agent:1.0.0
EnvironmentVariables:
- OPENSEARCH_ENDPOINT=
- S3_BUCKET=
- AWS_REGION=
- AGGREGATE_DURATION=10
Resources:
CPU: 0.5
Memory: 1Gi
Network:
Type: Full
InputSources:
- Target: /app/logs
Source:
Type: localDirectory
Params:
SourcePath: /data/log-orchestration/logs
- Target: /app/state
Source:
Type: localDirectory
Params:
SourcePath: /data/log-orchestration/state
ReadWrite: true
bacalhau job run logstash.yaml
Logstash might need a few moments to get up and running. To keep tabs on its start-up progress, you can use:
bacalhau logs --follow
Step 7 - How the logging agent works
- Raw Logs to S3: Every hour, compressed raw logs are sent to your specified S3 bucket. This is great for archival and deep-dive analysis.
- Real-time Metrics to OpenSearch: The agent pushes aggregated metrics to OpenSearch every AGGREGATE_DURATION seconds (e.g., every 10 seconds).
- Request Counts: Grouped by HTTP status codes.
- Top IPs: Top 10 source IPs by request count.
- Geo Sources: Top 10 geographic locations by request count.
- User Agents: Top 10 user agents by request count.
- Popular APIs & Pages: Top 10 most-hit APIs and pages.
- Gone Pages: Top 10 requested but non-existent pages.
- Unauthorized IPs: Top 10 IPs failing authentication.
- Throttled IPs: Top 10 IPs getting rate-limited.
- Data Volume: Total data transmitted in bytes.
FURTHER INFORMATION
CAVEAT
Your mileage may vary depending on the specific types of logs, the volume, and your existing infrastructure. You can use this calculator to get an estimate of your cost saving. Compression and decompression have CPU costs. But given that we’re saving significantly on data transfer and storage, it’s often a cost well worth incurring.
So there you have it. With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But hey, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.
Conclusion
With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.
By decentralizing the log processing system, Bacalhau not only vastly reduces operational costs – as evidenced by our benchmarking results – but also ensures real-time data processing and compliance with archival needs. Its seamless integration across various cloud platforms, including AWS, Azure, and Google Cloud, demonstrates our commitment to versatile, cross-platform solutions.
While Bacalhau is open source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us.