Log Vending

Log Vending
Introduction

AWS is one of the largest deployments of compute services in the world. Like anyone of significant size, in order to maintain quality of service and reliability, they have to take as much signal from their services, often in the form of logs. In order to deliver high reliability services, they had to handle petabytes of logs generated every hour across Amazon’s massive global infrastructure supporting millions of customers. 

Their users had diverse needs – real-time access to logs, long-term storage for compliance and efficient storage for large-scale batch processing – all without burning a hole in their pocket. The traditional approach, a centralized data lake, would have been prohibitively expensive, less efficient and a nightmare to operate. So, they went distributed. They would process logs where they were generated and orchestrated multiple delivery streams tailored to specific needs, whether that be real-time access or optimized storage for batch processing or long-term retention.

Thinking of achieving this level of efficiency and cost-saving outside of AWS? Our open-source product Bacalhau is the solution.

Challenges in traditional log vending.

Organizations today juggle the need for detailed logs with the challenges of managing massive data volumes. Platforms like Splunk, Data Dog, and others offer rich features, but costs spike with increased data intake. Key log management challenges across industries include the following.

VOLUME VS. VALUE
Most log ingestion involves write-intensive operations, but only a minor fraction of these logs are accessed or deemed valuable.

REAL-TIME NEEDS
Critical applications like threat detection and system health checks rely on specific metrics. Real-time metric solutions tend to be costlier and harder to oversee. Hence, early aggregation and filtering of these metrics are essential for both cost savings and operational scalability.

OPERATIONAL INSIGHTS
At times, operators require access to application and server logs for immediate troubleshooting. This access should be facilitated without connecting directly to the numerous production servers producing these logs.

ARCHIVAL NEEDS
Preserving raw logs is crucial for compliance, audits, and various analytical and operational improvement processes.

To address these varied needs, especially in real-time scenarios, many organizations resort to streaming all logs to a centralized platform or data warehouse. While this method can work, it often leads to spiraling costs and slower insights.

Solution: Distributed Log Orchestration

Bacalhau is a distributed compute framework that offers efficient log processing solutions, enhancing current platforms. Its strength lies in its adaptable job orchestration. Let’s explore its multi-job approach:

Expanso Diagram Architecture

Daemon Jobs:

  • Purpose: Bacalhau agents run continuously on selected nodes, auto-deployed by the orchestrator.
  • Function: These jobs handle logs at the source, aggregate, and compress them. They then send aggregated logs periodically to platforms like Kafka or Kinesis, or suitable logging services. Every hour, raw logs intended for archiving or batch processing move to places like S3.

Service Jobs:

  • Purpose: Handle continuous intermediate processing like log aggregation, basic statistics, deduplication, and issue detection. They run on specific nodes, with Bacalhau ensuring their optimal performance.
  • Function: Continuous log processing and integration with logging services for instant insights, e.g., Splunk.

Batch Jobs:

  • Purpose: Run on-demand on chosen nodes, with Bacalhau managing node choice and task monitoring.
  • Function: Operates intermittently on data in S3, focusing on in-depth investigations without moving the data, turning nodes into a distributed data warehouse.

Ops Jobs:

  • Purpose: Similar to batch jobs but spanning all fitting nodes.
  • Function: Executes ad-hoc queries across all jobs in a network, allowing end-users limited log access on host machines, avoiding any S3 transfer delays and ensuring rapid insights.

Bacalhau ensures log management is not only efficient but also responsive to changing business needs.

Bacalhau's global edge.

Bacalhau is designed for global reach and reliability. Here’s a snapshot of its worldwide log solution:

LOCAL LOG TRANSFERS
Daemon jobs swiftly send logs to close-by storages like regional S3 or MinIO. They stay active, even without Bacalhau connection, safeguarding data during outages.

REGIONAL LOG HANDLING
Autonomous service jobs in each region channel logs to either local or global platforms, preserving metrics when the network’s down.

SMART BATCH OPERATIONS
Bacalhau guides batch jobs to nearby data sources, cutting network costs and streamlining global tasks.

OPS JOBS FLEXIBILITY
Based on permissions, operators can target specific hosts, regions, or the entire network for queries.

With Bacalhau, global log management is both efficient and user-friendly, marrying the perks of decentralization with centralized clarity.

BENCHMARKING
Log vending cost

In the following we have calculated the cost for a case where one uploads the data directly to Splunk and processes the data there. In the second scenario data is preprocessed on-device, uploaded both to an archive server and a S3 bucket with a EC2 unit in AWS. The savings are higher than 99%.

Environment
Fleet
TPS Per Host 10,000 r/s
Number of Hosts 30
Raw Logs
Average Nginx Access Log Size 325 Bytes
Per Host Hourly Logs 11 GB
Per Host Hourly Compressed Logs (zstd) 0.8 GB
Fleet Wide Daily Raw Logs 7,845 GB
Fleet Wide Daily Compressed Raw Logs 541 GB
Aggregated Logs
Average Aggregated Event Size 170 Bytes
Aggregation Window 10 seconds
Metrics Per Aggregation Window 100
Per Host Hourly Aggregated Logs 0.006 GB
Fleet Wide Daily Aggregated Logs 4 GB
Fleet Wide TPS 300 Events
Splunk Cloud Ingestion Prices (Ref)
Plan Annual Total GB/Day
Splunk annual 5GB/Day $8,100 $4.4
Splunk annual 10GB/Day $13,800 $3.8
Splunk annual 20GB/Day $24,000 $3.3
Splunk annual 50GB/Day $50,000 $2.7
Splunk annual 100GB/Day $80,000 $2.2
Assuming 60% Discount For Higher Volume $32,000 $0.9
Direct to Splunk Bacalhau Pre-Processing
Splunk Daily Ingestion (GB) 7,845 4
Splunk Annual Cost $2,510,548 $6,648
S3 Monthly New Storage (GB) 16,232
S3 Monthly PUTs 21,600
S3 Monthly GETs 64,800
S3 Annual Standard Storage (first month) $373
S3 Annual Infrequent Access (next 2 months) $406
S3 Annual Glacier Instant Retrieval (next 9 months) $584
S3 Annual PUTs Cost $1.30
S3 Annual GETs Cost $0.31
S3 Annual Cost $1,365
EC2 Monthly Orchestrator m7g.medium Instances (x3) $90
EC2 Monthly Compute m7g.large Instances (x3) $251
EC2 Annual Cost $4,091
Kinesis Monthly Shard Cost (x3 Shards) $33
Kinesis Monthly PUT Units Cost $11
Kinesis Annual Cost $527
Total Annual Cost $2,510,548 $12,104
Cost Saving 99.52%

Log Context: We benchmarked using web server access logs, common tools to monitor server health, user activity, and security. Our goal was to mirror real-world situations for wide applicability.

Setup and Capacity: We simulated 30 web servers across multiple zones, each handling about 10k TPS (Transactions Per Second) of web requests. This provides a solid foundation for gauging Bacalhau's log processing prowess.

Platform Considerations: Our tests were on AWS, but Bacalhau's flexibility means similar results on Google Cloud, Azure, or on-premises systems.

MAIN TAKEAWAYS

BANDWIDTH EFFICIENCY
Bacalhau cut the bandwidth for log transfers from 11GB to 800MB per host hourly – a ~93% reduction, maintaining data integrity and real-time metrics.

COST SAVINGS
Direct logs to a service like Splunk would cost roughly $2.5 million annually. Bacalhau’s approach? A mere $12k per year, marking a cost slash of over 99%.

REAL-TIME INSIGHTS
Using aggregated data, we crafted detailed dashboards for monitoring web services, tracking traffic shifts, and spotting suspicious users.

ADDED PERKS
With Bacalhau, benefits span faster threat detection using Kinesis, intricate batch tasks with raw logs in S3, and long-term storage in Glacier. Plus, you can keep using your go-to log visualization and alert tools.

Step 0 - Prerequisites
  • Bacalhau CLI installed. If you haven’t yet, follow this guide.
  • AWS CDK CLI. You can find more info here.
  • An active AWS account (any other cloud provider works too, but the commands will be different).

In this example, your cluster will include:

  • A Bacalhau orchestrator EC2 instance
  • Three EC2 instances as web servers running Bacalhau agents
  • An S3 bucket for raw log storage
  • An OpenSearch cluster with a pre-configured dashboard and visualizations

Step-by-Step Deployment

First, clone the GitHub repository:

Now, install the required Node.js packages:

Bootstrap your AWS account if you haven’t used AWS CDK on your account already:

To deploy your stack without SSH access, run:

Need SSH access to your hosts? Use this instead:

Note: If you don’t have an SSH key pair, follow these steps. Deployment will take a few minutes. Perfect time for a quick coffee break!

Once the stack is deployed, you’ll receive these outputs:
  • OrchestratorPublicIp: Connect Bacalhau CLI to this IP.
  • OpenSearchEndpoint: The endpoint for aggregated logs.
  • BucketName: The S3 bucket for raw logs.
  • OpenSearchDashboard: Access your OpenSearch dashboard here.
  • OpenSearchPasswordRetriever: Retrieve OpenSearch master password with this command.

To configure your Bacalhau CLI, execute:

Verify your setup with:

You should see three compute nodes labeled service=web-server along with the orchestrator node.

After your CDK stack is up and running, the OpenSearch dashboard URL will pop up in your console, courtesy of the CDK outputs. You’ll hit a login page the first time you try to access the dashboard. No sweat, just use admin as the username.

To get your password, you don’t have to hunt; CDK outputs include a handy command tailored for this. Just fire up your terminal and run:

Swap <Secret ARN> with the actual ARN displayed in your CDK outputs.

Logged in successfully? Fantastic, let’s proceed!

We’ve prepared a [log-generator.yaml](<https://github.com/bacalhau-project/examples/blob/main/log-orchestration/jobs/log-generator.yaml>) file for deploying a daemon job to simulate web access logs.
What’s Happening: The daemon job is deployed on all nodes with the label service=web-server. The job employs a Docker image that mimics Nginx access logs and stores them locally. Deploy it like so:

Check the job’s status:

You should see three executions in your three web servers in Running state:
Bacalhau’s versatility means that you can deploy any type of job, use any existing logging agent or implement your own. We’ll use Logstash for this example. Take a look at [logstash.yaml](<https://github.com/bacalhau-project/examples/blob/main/log-orchestration/jobs/logstash.yaml>).
Don’t Forget: Update <OpenSearchEndpoint>, <S3BucketName>, and <AWSRegion> according to your CDK outputs. Deploy your logging agent:

Logstash might need a few moments to get up and running. To keep tabs on its start-up progress, you can use:

In our setup, we’re using the expanso/nginx-access-log-agent Docker image. Here’s a quick rundown of what you get out of the box with this choice:
  1. Raw Logs to S3: Every hour, compressed raw logs are sent to your specified S3 bucket. This is great for archival and deep-dive analysis.
  2. Real-time Metrics to OpenSearch: The agent pushes aggregated metrics to OpenSearch every AGGREGATE_DURATION seconds (e.g., every 10 seconds).
You can learn more about our Logstash pipeline configuration and aggregation implementation here. These are a subset of the aggregated metrics published to OpenSearch:
  • Request Counts: Grouped by HTTP status codes.
  • Top IPs: Top 10 source IPs by request count.
  • Geo Sources: Top 10 geographic locations by request count.
  • User Agents: Top 10 user agents by request count.
  • Popular APIs & Pages: Top 10 most-hit APIs and pages.
  • Gone Pages: Top 10 requested but non-existent pages.
  • Unauthorized IPs: Top 10 IPs failing authentication.
  • Throttled IPs: Top 10 IPs getting rate-limited.
  • Data Volume: Total data transmitted in bytes.
This gives you real-time insights into traffic patterns, performance issues, and potential security risks. All of this is visible through your OpenSearch dashboards. After a while, your OpenSearch dashboard will be bursting with insights. Metrics on user behavior, traffic hotspots, error rates, you name it!
FURTHER INFORMATION

CAVEAT
Your mileage may vary depending on the specific types of logs, the volume, and your existing infrastructure. You can use this calculator to get an estimate of your cost saving. Compression and decompression have CPU costs. But given that we’re saving significantly on data transfer and storage, it’s often a cost well worth incurring.

So there you have it. With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But hey, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.

Conclusion

With Bacalhau, setting up a robust log management system is pretty much a walk in the park. But, we’re just scratching the surface here. The framework’s adaptability and resilience make it a must-have tool for any enterprise aiming to keep their log data in check.

By decentralizing the log processing system, Bacalhau not only vastly reduces operational costs – as evidenced by our benchmarking results – but also ensures real-time data processing and compliance with archival needs. Its seamless integration across various cloud platforms, including AWS, Azure, and Google Cloud, demonstrates our commitment to versatile, cross-platform solutions.

While Bacalhau is open source software, the Bacalhau binaries go through the security, verification, and signing build process lovingly crafted by Expanso. You can read more about the difference between open source Bacalhau and commercially supported Bacalhau in our FAQ. If you would like to use our pre-built binaries and receive commercial support, please contact us.

Feeling inspired to get started?

For more information, read more Use Cases or visit our Documentation. For an in-depth exploration, visit our Getting Started Tutorial.