Post

Linux Troubleshooting Checklist - A Professional Guide

A comprehensive guide to systematic Linux troubleshooting, covering system health checks, performance analysis, and standardized logging practices.

Linux Troubleshooting Checklist - A Professional Guide

“Effective troubleshooting isn’t about knowing solutions—it’s about following a systematic process to discover them.”

Table of Contents

🧠 Systematic Approach to Troubleshooting

When a system fails, how you approach the problem reveals your level of experience and professionalism.

Beginner Professional
Randomly tries solutions Follows a methodical process
Relies on memorized fixes Understands the underlying system
Gets frustrated easily Remains calm and systematic
Reinstalls as first resort Identifies root cause before fixing
Focuses only on symptoms Understands cause-effect relationships
Overlooks documentation Consults logs and documentation first

The Professional Troubleshooting Framework

  1. Identify and Isolate
    • Define the problem precisely
    • Determine when it started
    • Identify affected components
  2. Gather Information
    • Check logs (system, application, security)
    • Review recent changes
    • Verify resource availability (CPU, memory, disk)
  3. Form Hypothesis
    • Based on evidence, not guesses
    • Consider multiple potential causes
    • Prioritize by likelihood and impact
  4. Test Hypothesis
    • Make one change at a time
    • Document each test and result
    • Use reversal tests when appropriate
  5. Implement Solution
    • Apply the proven fix
    • Verify full functionality
    • Document the resolution
  6. Review and Learn
    • Analyze the root cause
    • Improve monitoring for early detection
    • Share knowledge with team

Tip: Always ask, “What changed?” Most problems occur after a system alteration, whether obvious (like a software update) or subtle (like a configuration change).

🔍 System Health Checks

Before diving into specific problems, professional Linux administrators perform comprehensive health checks to establish a baseline and identify issues.

Essential System Status Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Overall system status
uptime
vmstat 1 5
top -b -n 1

# Memory usage
free -h
cat /proc/meminfo

# Disk usage
df -h
du -sh /* 2>/dev/null | sort -hr

# CPU information
cat /proc/cpuinfo
lscpu

# Process status
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10

Quick Health Check Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash
# quick-health-check.sh
# Usage: ./quick-health-check.sh [output_file]

OUTPUT=${1:-health_$(hostname)_$(date +%Y%m%d_%H%M%S).log}

{
  echo "========== SYSTEM INFO =========="
  echo "Date: $(date)"
  echo "Hostname: $(hostname)"
  echo "Kernel: $(uname -r)"
  echo "Uptime: $(uptime)"
  
  echo -e "\n========== CPU USAGE =========="
  echo "Load average:"
  uptime
  echo -e "\nTop CPU processes:"
  ps aux --sort=-%cpu | head -6
  
  echo -e "\n========== MEMORY USAGE =========="
  free -h
  echo -e "\nTop memory processes:"
  ps aux --sort=-%mem | head -6
  
  echo -e "\n========== DISK USAGE =========="
  df -h
  echo -e "\nLargest directories:"
  du -sh /* 2>/dev/null | sort -hr | head -5
  
  echo -e "\n========== NETWORK STATUS =========="
  echo "Network interfaces:"
  ip -br addr
  echo -e "\nOpen connections:"
  ss -tuln
  
  echo -e "\n========== RECENT ERRORS =========="
  echo "Last 5 system errors:"
  journalctl -p err -n 5 --no-pager
} | tee "$OUTPUT"

echo "Health check complete. Results saved to $OUTPUT"

Advanced System Monitoring Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# System activity reports
sar -u 1 5         # CPU usage
sar -r 1 5         # Memory usage
sar -b 1 5         # I/O statistics
sar -n DEV 1 5     # Network statistics

# Process monitoring
htop               # Interactive process viewer
atop               # For system bottleneck analysis
lsof               # List open files

# Disk I/O monitoring
iostat -xz 1 5     # Extended disk statistics
iotop              # I/O monitoring by process

Warning: Some monitoring tools like htop, atop, and iotop may need to be installed separately as they’re not always included in minimal installations.

🛠️ Common Problems and Solutions

CPU Issues

Symptom Possible Causes Investigation Commands Common Solutions
High CPU usage Runaway process ps aux --sort=-%cpu | head Kill/restart process
  Malware/cryptominer ps aux | grep -i "crypto\|mine\|coin" Identify and remove
  Service misconfiguration systemctl status <service> Reconfigure service
System slowness Too many processes pstree -p Optimize startup services
  Resource contention nice, ionice, cpulimit Apply resource constraints
  Kernel issues dmesg | grep -i error Update kernel

Example: CPU Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
# Find CPU-hungry processes
ps aux --sort=-%cpu | head -10

# Check if a specific process is using too much CPU
top -p $(pgrep -d ',' apache2)

# Limit CPU usage for a process
cpulimit -p 1234 -l 50  # Limit PID 1234 to 50% CPU

# Set process niceness (lower priority)
renice +10 -p 1234

Memory Issues

Symptom Possible Causes Investigation Commands Common Solutions
Out of memory Memory leak ps aux --sort=-%mem Restart leaking service
  Swap misconfiguration swapon --show, cat /proc/swaps Adjust swappiness
  Too many processes pmap -x <pid> Optimize processes
High RAM usage Caching free -h, cat /proc/meminfo Understand cache behavior
  Database issues mysqltuner Tune database settings
  Memory fragmentation cat /proc/buddyinfo Restart service or system

Example: Memory Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check memory usage and cache
free -h
cat /proc/meminfo | grep -E 'Mem|Cache|Swap'

# Find memory-consuming processes
ps aux --sort=-%mem | head -10

# Examine detailed memory usage of a process
pmap -x $(pgrep mysql)

# Empty page cache (if needed)
echo 1 > /proc/sys/vm/drop_caches

# Monitor memory over time
watch -n 1 'free -h'

Disk Issues

Symptom Possible Causes Investigation Commands Common Solutions
Disk full Large files du -sh /* | sort -hr Clean up, compress, or archive
  Temp files find /tmp -type f -size +100M Remove temp files
  Log files find /var/log -type f -size +100M Rotate logs, compress logs
  Orphaned files lsof +L1 Remove orphaned files
High I/O Inefficient process iotop Optimize process I/O
  RAID issues cat /proc/mdstat Check RAID status
  Filesystem fragmentation e2freefrag /dev/sda1 Defragment if needed

Example: Disk Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find largest files
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5hr

# Find largest directories
du -sh /* 2>/dev/null | sort -hr | head -10

# Check inodes usage
df -i

# Check for open deleted files still consuming space
lsof +L1 | grep 'deleted'

# Check disk I/O by process
iotop -o

# Find recently modified large files
find / -type f -size +10M -mtime -7 -ls 2>/dev/null

Service Issues

Symptom Possible Causes Investigation Commands Common Solutions
Service won’t start Misconfiguration systemctl status <service> Fix configuration
  Dependencies systemctl list-dependencies <service> Ensure dependencies run
  Permission issues journalctl -u <service> Fix permissions
Service crashes Resource limits ulimit -a Adjust ulimit
  Bugs journalctl -u <service> -p err Update software
  Incompatibilities ldd $(which <binary>) Check library dependencies

Example: Service Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Check service status
systemctl status nginx

# View service logs
journalctl -u nginx --since "1 hour ago"

# Check configuration syntax
nginx -t

# Examine service dependencies
systemctl list-dependencies nginx

# Check process limits
cat /proc/$(pidof nginx)/limits

# Check file descriptor usage
lsof -p $(pidof nginx) | wc -l

📊 Log Analysis Techniques

Logs are the first place professionals look when troubleshooting. Knowing how to effectively analyze logs is a core skill.

Key Log Files and Their Purpose

Log File Content Common Issues
/var/log/syslog or /var/log/messages General system messages System-wide issues
/var/log/auth.log or /var/log/secure Authentication events Login failures, security
/var/log/kern.log Kernel messages Hardware, driver issues
/var/log/dmesg Boot-time messages Boot problems, hardware detection
/var/log/apache2/ or /var/log/httpd/ Web server logs Web application issues
/var/log/mysql/ Database logs Database performance, errors
/var/log/apt/ or /var/log/yum.log Package management Installation problems
/var/log/fail2ban.log Intrusion prevention Security violations

Effective Log Filtering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find error messages
grep -i error /var/log/syslog

# Show context around errors
grep -i -A3 -B2 "fatal" /var/log/apache2/error.log

# Filter by timestamp
grep "Apr 18 10:" /var/log/auth.log

# Filter by multiple patterns
grep -E "error|warning|critical" /var/log/syslog

# Filter out noise
grep -v "irrelevant pattern" /var/log/application.log

# Find recent authentication failures
grep "Failed password" /var/log/auth.log | tail -20

Advanced Log Analysis Tools

1
2
3
4
5
6
7
8
9
10
11
12
# Use journalctl for systemd logs
journalctl -u nginx --since "1 hour ago"
journalctl -p err..emerg --since today

# Use logwatch for summary reports
logwatch --service apache --range yesterday --detail high

# Use multitail to watch multiple logs
multitail /var/log/nginx/error.log /var/log/mysql/error.log

# Use lnav for interactive log viewing
lnav /var/log/syslog /var/log/auth.log

Tip: Regularly reviewing logs even when there are no apparent issues helps you understand what “normal” looks like. This baseline knowledge is invaluable when troubleshooting.

📈 Performance Troubleshooting

System performance issues require a structured approach to identify bottlenecks.

Performance Analysis Workflow

  1. Establish a baseline
    • Document normal performance metrics
    • Use historical data if available
  2. Identify symptoms
    • Slowness during specific operations
    • High load averages
    • Poor response times
  3. Check the four key resources
    • CPU
    • Memory
    • Disk I/O
    • Network
  4. Analyze processes
    • Which processes are consuming resources
    • Are there unexpected processes
    • Process relationships (parent-child)
  5. Examine detailed metrics
    • System calls
    • Context switches
    • File descriptors
    • Thread counts

Performance Monitoring Tools

Tool Purpose Example Usage
top/htop Real-time process monitoring htop -d 5
vmstat Virtual memory statistics vmstat 5 10
iostat I/O statistics iostat -xz 5 10
mpstat Multi-processor statistics mpstat -P ALL 5 10
sar System activity reporter sar -n DEV 5 10
strace Trace system calls strace -p <pid>
perf Performance analysis perf top -p <pid>
netstat/ss Network statistics ss -tunapl
iotop I/O monitoring iotop -oPa
nmon Performance monitoring nmon

CPU Profiling

1
2
3
4
5
6
7
8
9
10
# Basic CPU profiling
perf record -g -p <pid> -- sleep 30
perf report

# CPU flame graph (requires FlameGraph tools)
perf record -g -p <pid> -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flame.svg

# Process CPU time breakdown
pidstat -t -p <pid> 1 10

Memory Profiling

1
2
3
4
5
6
7
8
9
# Memory usage patterns
valgrind --tool=massif --massif-out-file=massif.out <program>
ms_print massif.out

# Process memory maps
pmap -x <pid>

# Track memory allocations
mtrace <program>

Disk I/O Profiling

1
2
3
4
5
6
7
8
# Disk I/O by process
iotop -aoP

# File system latency
ioping -c 10 /path/to/directory

# Block I/O tracing
blktrace -d /dev/sda -o - | blkparse -i -

Network Profiling

1
2
3
4
5
6
7
8
9
# Network traffic by process
nethogs

# Packet capture and analysis
tcpdump -i eth0 port 80 -w capture.pcap
wireshark capture.pcap

# Network connections by process
ss -tp

Note: Many of these specialized tools may need to be installed separately with your package manager.

🌐 Network Diagnostics

Network issues are common and can be challenging to troubleshoot without a systematic approach.

Network Troubleshooting Checklist

  1. Verify physical connectivity
    • Check cables, link lights, etc.
    • Verify interface status with ip link
  2. Check IP configuration
    • Proper IP address, subnet, gateway
    • DNS server configuration
  3. Test basic connectivity
    • Local network with ping
    • External networks with traceroute/mtr
  4. Check network services
    • Service status and ports
    • Firewall rules
  5. Analyze packet flow
    • Capture and analyze traffic
    • Check for packet loss or latency

Essential Network Diagnostic Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Check network interfaces
ip addr
ip link

# Check routing
ip route
route -n

# DNS resolution
nslookup google.com
dig google.com

# Connectivity testing
ping -c 4 8.8.8.8
traceroute google.com
mtr google.com

# Port checking
nc -zv google.com 443
telnet google.com 443

# Open connections
ss -tunapl
netstat -tunapl

# Packet capture
tcpdump -i eth0 host 8.8.8.8

Common Network Issues and Solutions

Symptom Possible Causes Investigation Commands Common Solutions
No connectivity Wrong IP/subnet ip addr, ip route Fix configuration
  Interface down ip link Bring up interface
  Routing issue traceroute, mtr Check/fix routing table
DNS problems Wrong DNS servers cat /etc/resolv.conf Fix DNS configuration
  DNS service issues dig @8.8.8.8 google.com Try alternative DNS
  Caching issues systemctl restart systemd-resolved Restart DNS service
Can’t connect to service Service down systemctl status <service> Start the service
  Firewall blocking iptables -L, ufw status Adjust firewall rules
  Wrong port ss -tunapl \| grep <service> Configure correct port

Network Troubleshooting Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Troubleshoot DNS
dig +trace google.com
host -v google.com
resolvectl status

# Troubleshoot routing
ip route get 8.8.8.8
traceroute -n 8.8.8.8
mtr -n 8.8.8.8

# Troubleshoot network performance
iperf3 -c iperf.server.com
ping -c 100 -i 0.2 8.8.8.8 | grep -oP '\d+\.\d+(?=/)' | awk '{sum+=$1} END {print sum/NR}'

# Check for packet loss
ping -c 100 8.8.8.8 | grep -oP '\d+(?=% packet loss)'

# Find which process is using a port
lsof -i :80
ss -tunapl | grep :80

Tip: When troubleshooting network issues, work from the lowest layer up: physical → link → network → transport → application.

📝 Documentation Practices

Professional troubleshooters document their processes and findings methodically.

Why Document?

  1. Future reference - Similar issues can recur
  2. Knowledge sharing - Help others learn from your experience
  3. Audit trail - Record what changed and why
  4. Process improvement - Identify patterns and systemic issues
  5. Handover - Enable others to continue your work

What to Document

Beginner Professional
Just the solution Problem description, diagnosis steps, and solution
Commands used Commands with expected and actual outputs
Simplified steps Detailed process with reasoning
Personal notes Shareable, searchable documentation
Without context Including timestamps and system state

Documentation Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Incident Report: [Brief Description]

## Summary
- **Date/Time**: [When the issue occurred]
- **System(s) Affected**: [Specific hosts/services]
- **Impact**: [User/business impact]
- **Root Cause**: [Brief cause statement]

## Timeline
- **[Time]**: Issue detected [how it was detected]
- **[Time]**: Initial investigation began
- **[Time]**: [Investigation step]
- **[Time]**: Root cause identified
- **[Time]**: Solution implemented
- **[Time]**: Service restored

## Investigation Details
1. **Initial symptoms observed**:

[Output of initial commands/logs]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2. **Diagnostic steps**:
   - Checked [system component] using:
     ```
     [command and output]
     ```
   - Verified [condition] by:
     ```
     [command and output]
     ```

3. **Root cause analysis**:
   [Explanation of what caused the issue]

## Solution
1. **Immediate fix**:

[Commands/actions taken to resolve]

1
2
2. **Verification**:

[Commands/checks performed to verify fix]

1
2
3
4
5
6
7
8
9
10
## Prevention
- **Monitoring**: [New monitoring/alerts implemented]
- **Automation**: [Automation to prevent recurrence]
- **Process**: [Process changes recommended]

## References
- [Relevant documentation links]
- [Knowledge base articles]
- [Similar past incidents]

Documentation Best Practices

  1. Use version control for configuration files and scripts
  2. Maintain a systems journal to track changes
  3. Create a knowledge base of common issues and solutions
  4. Use standard templates for different types of documentation
  5. Include context (what, why, when, how) in all documentation
  6. Make documentation searchable with tags and categories
  7. Review and update documentation regularly

Tip: Good documentation transforms individual knowledge into team knowledge, reducing “bus factor” risk.

🔄 Recovery Tools and Techniques

When things go seriously wrong, professionals have a toolkit ready to recover systems.

Rescue Environments

Tool Description When to Use
SystemRescue Live Linux environment File recovery, system repair
GParted Live Partition management Disk partitioning issues
Boot-Repair-Disk Boot loader repair GRUB/boot problems
Clonezilla Disk cloning System backup/restore
DBAN Disk wiping Secure data destruction

Common Recovery Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# File system check
fsck -f /dev/sda1

# Repair boot loader
grub-install /dev/sda
update-grub

# Recover deleted files
extundelete /dev/sda1 --restore-file /path/to/file

# Rescue data from failing disk
ddrescue /dev/sda /dev/sdb rescue.log

# Mount filesystem read-only for inspection
mount -o ro /dev/sda1 /mnt/recovery

# Check disk for bad sectors
badblocks -v /dev/sda

Recovery Strategies

  1. Boot issues
    • Use rescue disk to boot
    • Check and repair boot loader
    • Examine boot logs
  2. File system corruption
    • Mount read-only if possible
    • Run fsck on unmounted filesystems
    • Check SMART status of disks
  3. Database recovery
    • Use database-specific tools (e.g., mysqlcheck)
    • Apply transaction logs
    • Restore from backups
  4. Compromised systems
    • Isolate the system
    • Collect forensic evidence
    • Scan for malware
    • Rebuild from known good state

Recovery Plan Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. Boot from rescue media
# [Instructions for specific rescue disk]

# 2. Mount filesystem read-only
mount -o ro /dev/sda1 /mnt/recover

# 3. Back up critical data
mkdir /mnt/backup
rsync -av /mnt/recover/home/ /mnt/backup/

# 4. Check filesystem
umount /mnt/recover
fsck -f /dev/sda1

# 5. Repair system files
mount /dev/sda1 /mnt/recover
chroot /mnt/recover

# 6. Repair boot loader
grub-install /dev/sda
update-grub

# 7. Verify critical services
systemctl list-unit-files --state=enabled

Warning: Always back up data before attempting system recovery. Test recovery procedures on non-critical systems before applying to production.

🔎 Preventative Measures

Professionals know that preventing problems is more efficient than fixing them.

Monitoring Tools

Tool Description What to Monitor
Nagios Network/service monitoring Service availability, performance
Prometheus Metrics and alerting System metrics, custom metrics
Grafana Metrics visualization Dashboards for all metrics
ELK Stack Log aggregation/analysis Centralized log management
Zabbix Enterprise monitoring Infrastructure monitoring
Netdata Real-time monitoring Low-level system metrics

Preventative Checks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Scheduled disk checks
tune2fs -l /dev/sda1 | grep 'Mount count'

# Check for updates
apt update && apt list --upgradable

# Check disk space growth trends
df -h >> /var/log/disk_usage.log

# Check for failed services
systemctl --failed

# Verify backups
find /backup/ -name "*.tar.gz" -mtime -1 | wc -l

# Check for compromised packages
debsums -c

Configuration Management

Using configuration management tools helps prevent configuration drift:

  1. Ansible - Agentless configuration management
  2. Puppet - Declarative configuration management
  3. Chef - Infrastructure as code
  4. Salt - Event-driven automation
  5. Terraform - Infrastructure provisioning

Regular Maintenance Schedule

Interval Task Command Example
Daily Check disk space df -h
  Review logs journalctl -p err -s yesterday
  Verify backups ls -l /backup/daily/
Weekly Update packages apt update && apt upgrade
  Check service status systemctl list-units --state=failed
  Performance baseline sar -A > /var/log/performance/weekly.log
Monthly Full security scan lynis audit system
  User account audit getent passwd \| sort
  Filesystem check tune2fs -C 999 /dev/sda1
Quarterly Disaster recovery test [Restore test procedure]
  Performance review compare baseline metrics
  Security audit [Security checklist]

Tip: Automate as many maintenance tasks as possible. Human-performed maintenance should focus on reviewing automated reports and addressing exceptions.

📋 Log Management and Standardization

Effective log management goes beyond just viewing logs when problems occur.

Log Rotation - Beyond Default Settings

Beginner Professional
Uses default log rotation settings Customizes rotation based on system purpose
Ignores log files until disk space issues Proactively manages log growth
Manually deletes old logs Configures automated rotation policies
One-size-fits-all approach Service-specific rotation strategies

Key Log Rotation Parameters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Example logrotate configuration
/var/log/myapp/*.log {
    daily                   # Rotate frequency
    rotate 14               # Keep 14 rotated logs
    compress                # Compress rotated logs
    delaycompress           # Delay compression by one cycle
    missingok               # Don't error if log is missing
    notifempty              # Don't rotate empty logs
    create 0640 www-data    # Create new log with permissions
    sharedscripts           # Run scripts once per rotation
    postrotate
        systemctl reload myapp
    endscript
}

Standardizing Log Formats

Professional logging includes standardized formats with these elements:

  1. Timestamps - ISO 8601 format (YYYY-MM-DDTHH:MM:SS.sss±ZZZZ)
  2. Severity levels - Use standard levels (ERROR, WARN, INFO, DEBUG)
  3. Source identification - Service, module, or function
  4. Correlation IDs - To track related events across services
  5. Structured data - JSON or similar for machine parsing
  6. Context - User IDs, request IDs, or other relevant context

Example Standardized Log Format

1
2025-04-18T10:30:45.123+0530 INFO [web-server] [request-id: 1a2b3c4d] User authenticated: user_id=123, source_ip=192.168.1.100, result=success

Application Logging Best Practices

Configure applications to use these logging best practices:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Example rsyslog configuration
# /etc/rsyslog.d/myapp.conf

template(name="JsonFormat" type="list") {
    constant(value="{")
    constant(value="\"timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
    constant(value="\",\"host\":\"")       property(name="hostname")
    constant(value="\",\"severity\":\"")   property(name="syslogseverity-text")
    constant(value="\",\"facility\":\"")   property(name="syslogfacility-text")
    constant(value="\",\"tag\":\"")        property(name="syslogtag" format="json")
    constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}

# Send myapp logs to a dedicated file in JSON format
if $programname == 'myapp' then {
    action(type="omfile" file="/var/log/myapp/application.log" template="JsonFormat")
}

Centralized Logging

Professionals often implement centralized logging for easier analysis:

  1. Collection - Filebeat, Fluentd, Logstash
  2. Storage - Elasticsearch, Loki, Graylog
  3. Analysis - Kibana, Grafana
  4. Alerting - ElastAlert, Grafana alerts

Example Filebeat Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/myapp/*.log
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: message

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "myapp-%{+YYYY.MM.dd}"

Tip: Design logs for both human readability and machine parsing. Structured logs are easier to search, filter, and analyze at scale.

📌 Final Thought

“Problems are inevitable; being unprepared is optional.”

Effective troubleshooting is not just a technical skill—it’s a mindset. By developing a systematic approach to diagnosing and resolving issues, you transform from someone who fights fires to someone who manages systems proactively.

The difference between a beginner and a professional is not just knowledge of commands or tools, but the discipline to follow methodical processes, document thoroughly, and learn from each incident.

Remember that the best troubleshooters are those who:

  1. Understand their systems deeply
  2. Follow structured approaches
  3. Document everything
  4. Learn from every incident
  5. Implement preventative measures

This checklist is your roadmap to developing that professional troubleshooting mindset.

This post is licensed under CC BY 4.0 by the author.