Post

12 Logging Best Practices to Save You from 3 AM Debugging

12 Logging Best Practices to Save You from 3 AM Debugging

Table of Contents

A Clear Logging Game Plan

Effective logging starts with a plan. Don’t haphazardly scatter log statements; instead, define clear objectives. Before writing a single log line, ask yourself:

  • What are my application’s main goals?
  • Which critical operations need monitoring?
  • What KPIs truly matter?

For error logging, the goal isn’t just to signal an error, but to provide enough context for quick resolution. Consider what your future self (at 3:00 AM) will need. It’s easier to remove excessive logs than to add missing information later. Regularly review your logging strategy to identify and remove unnecessary noise. The best logging strategy focuses on capturing the right information, not everything.

Understanding Log Levels

Log levels provide a structured approach to logging. Four common levels are:

  1. INFO: Business-as-usual events. Examples: successful logins, completed transactions. user completed checkout (order #12345)
  2. WARNING: Potential issues; things aren’t quite right, but not critical failures. Example: payment processing taking longer than usual.
  3. ERROR: Actual problems; failures, exceptions. Example: database connection failed.
  4. FATAL: Catastrophic failures; application or system crash. Example: system out of memory, shutting down.

In production, most applications default to the INFO level. However, during debugging, increase the verbosity temporarily to capture more detailed information. Many tools allow you to adjust verbosity; incorporate this into your application’s logging configuration.

Structured Logging: The Power of Organization

Avoid unstructured logging, which resembles a wall of text. Instead, use structured logging. Each piece of information should reside in its own field, resulting in a more readable and analyzable log format (e.g., JSON).

Unstructured Log: Error processing order #123 - database timeout

Structured Log (JSON Example):

1
2
3
4
5
6
{
  "event": "order_processing_error",
  "order_id": 123,
  "error": "database_timeout",
  "timestamp": "2024-08-08T14:20:00Z"
}

Structured logs are easily searchable and analyzable. Logging frameworks often support structured logging; consider using one. Tools like Vector can transform unstructured logs into structured formats.

What to Log: Context is King

A simple “something went wrong” log entry is unhelpful. Include context:

  • Who: User ID, session ID
  • What: Action performed, request details
  • Where: Location, service, endpoint
  • Why: Error message, stack trace

Example:

Poor Log: Error processing payment

Good Log:

1
2
3
4
5
6
7
8
{
  "event": "payment_processing_error",
  "user_id": "12345",
  "payment_method": "credit_card",
  "amount": 100.00,
  "error": "invalid_credit_card",
  "timestamp": "2024-08-08T14:25:00Z"
}

Include request IDs for tracing across microservices, system state data (database/cache status), and full error context (stack traces). Logs are your system’s black box recorder; make them detailed enough to reconstruct events.

Log Sampling: Cost Optimization

High-traffic systems generate massive log volumes. Storing every log is expensive and often unnecessary. Log sampling addresses this: store a representative sample of logs instead of all logs.

For example, a 20% sampling rate for authentication logs means storing only 2 out of 10 identical login events. You can be selective: keep all error logs, but sample success logs. Sample aggressively for less critical endpoints, but maintain full logs for critical sections. Logging and observability frameworks (like OpenTelemetry) provide built-in sampling capabilities. Sampling can significantly reduce logging costs while retaining valuable insights.

Canonical Log Lines: The Whole Story in One Entry

Instead of scattered log entries, create canonical log lines—single entries summarizing entire events. For instance, at the end of each request, log a summary: user action, user identity, outcome, duration, database time. This simplifies debugging; instead of searching through many entries, you have a single, comprehensive record. Distributed tracing (e.g., with OpenTelemetry) provides a superior alternative, allowing you to trace requests across multiple services.

Centralized Logging: Aggregation for Efficiency

Aggregate and centralize logs from various services into a single location. This enables unified search, analysis of inter-service impacts, and team-wide access to the same data. Avoid the frustration of debugging across numerous disparate log sources. Correlating events becomes easy; a user reporting a checkout failure can be traced to specific service slowdowns or timeouts.

Retention Policies: Managing Log Storage

Establish a retention policy to manage log storage costs. Different log types have varying retention needs:

  • Error logs: 90 days
  • Debug logs: 7 days
  • Security audit logs: 1 year

Set up the policy before excessive costs arise.

Securing Your Logs: Encryption and Access Control

Logs often contain sensitive data (user IDs, IP addresses, database queries). Protect them using:

  1. Encryption in transit: Protect logs during transfer.
  2. Encryption at rest: Protect logs while stored.
  3. Access control: Restrict access based on roles (e.g., junior developers see basic logs, security teams have full access). Some log managers provide audit logging to track access.

What NOT to Log: Protecting Sensitive Data

Never log sensitive data (passwords, credit card numbers, API keys) in plain text. Use techniques like:

  • Redaction: Remove or replace sensitive information.
  • Tokenization: Replace sensitive data with non-sensitive tokens.
  • Data masking: Partially obscure sensitive data.

Filtering and redaction in your logging pipeline can prevent sensitive data from reaching storage. OpenTelemetry collector can help.

Performance Impact of Logging

Logging adds overhead. To minimize performance impact:

  1. Use efficient logging libraries (e.g., Go’s slog).
  2. Use log sampling in high-traffic paths.
  3. Log to a separate disk partition.
  4. Conduct load tests to identify and address logging bottlenecks.

Logs vs. Metrics: The Right Tool for the Job

Logs describe what happened; metrics describe how often things happen. Use logs for debugging, and metrics for real-time monitoring and alerting. Metrics allow you to spot trends and proactively address issues before they escalate into incidents.

This post is licensed under CC BY 4.0 by the author.