Post

Linux Troubleshooting Checklist - A Professional Guide

A comprehensive guide to systematic Linux troubleshooting, covering system health checks, performance analysis, and standardized logging practices.

Linux Troubleshooting Checklist - A Professional Guide

“Effective troubleshooting isn’t about knowing solutions—it’s about following a systematic process to discover them.”

Table of Contents

🧠 Systematic Approach to Troubleshooting

When a system fails, how you approach the problem reveals your level of experience and professionalism.

BeginnerProfessional
Randomly tries solutionsFollows a methodical process
Relies on memorized fixesUnderstands the underlying system
Gets frustrated easilyRemains calm and systematic
Reinstalls as first resortIdentifies root cause before fixing
Focuses only on symptomsUnderstands cause-effect relationships
Overlooks documentationConsults logs and documentation first

The Professional Troubleshooting Framework

  1. Identify and Isolate
    • Define the problem precisely
    • Determine when it started
    • Identify affected components
  2. Gather Information
    • Check logs (system, application, security)
    • Review recent changes
    • Verify resource availability (CPU, memory, disk)
  3. Form Hypothesis
    • Based on evidence, not guesses
    • Consider multiple potential causes
    • Prioritize by likelihood and impact
  4. Test Hypothesis
    • Make one change at a time
    • Document each test and result
    • Use reversal tests when appropriate
  5. Implement Solution
    • Apply the proven fix
    • Verify full functionality
    • Document the resolution
  6. Review and Learn
    • Analyze the root cause
    • Improve monitoring for early detection
    • Share knowledge with team

Tip: Always ask, “What changed?” Most problems occur after a system alteration, whether obvious (like a software update) or subtle (like a configuration change).

🔍 System Health Checks

Before diving into specific problems, professional Linux administrators perform comprehensive health checks to establish a baseline and identify issues.

Essential System Status Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Overall system status
uptime
vmstat 1 5
top -b -n 1

# Memory usage
free -h
cat /proc/meminfo

# Disk usage
df -h
du -sh /* 2>/dev/null | sort -hr

# CPU information
cat /proc/cpuinfo
lscpu

# Process status
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10

Quick Health Check Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash
# quick-health-check.sh
# Usage: ./quick-health-check.sh [output_file]

OUTPUT=${1:-health_$(hostname)_$(date +%Y%m%d_%H%M%S).log}

{
  echo "========== SYSTEM INFO =========="
  echo "Date: $(date)"
  echo "Hostname: $(hostname)"
  echo "Kernel: $(uname -r)"
  echo "Uptime: $(uptime)"
  
  echo -e "\n========== CPU USAGE =========="
  echo "Load average:"
  uptime
  echo -e "\nTop CPU processes:"
  ps aux --sort=-%cpu | head -6
  
  echo -e "\n========== MEMORY USAGE =========="
  free -h
  echo -e "\nTop memory processes:"
  ps aux --sort=-%mem | head -6
  
  echo -e "\n========== DISK USAGE =========="
  df -h
  echo -e "\nLargest directories:"
  du -sh /* 2>/dev/null | sort -hr | head -5
  
  echo -e "\n========== NETWORK STATUS =========="
  echo "Network interfaces:"
  ip -br addr
  echo -e "\nOpen connections:"
  ss -tuln
  
  echo -e "\n========== RECENT ERRORS =========="
  echo "Last 5 system errors:"
  journalctl -p err -n 5 --no-pager
} | tee "$OUTPUT"

echo "Health check complete. Results saved to $OUTPUT"

Advanced System Monitoring Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# System activity reports
sar -u 1 5         # CPU usage
sar -r 1 5         # Memory usage
sar -b 1 5         # I/O statistics
sar -n DEV 1 5     # Network statistics

# Process monitoring
htop               # Interactive process viewer
atop               # For system bottleneck analysis
lsof               # List open files

# Disk I/O monitoring
iostat -xz 1 5     # Extended disk statistics
iotop              # I/O monitoring by process

Warning: Some monitoring tools like htop, atop, and iotop may need to be installed separately as they’re not always included in minimal installations.

🛠️ Common Problems and Solutions

CPU Issues

SymptomPossible CausesInvestigation CommandsCommon Solutions
High CPU usageRunaway processps aux --sort=-%cpu | headKill/restart process
 Malware/cryptominerps aux | grep -i "crypto\|mine\|coin"Identify and remove
 Service misconfigurationsystemctl status <service>Reconfigure service
System slownessToo many processespstree -pOptimize startup services
 Resource contentionnice, ionice, cpulimitApply resource constraints
 Kernel issuesdmesg | grep -i errorUpdate kernel

Example: CPU Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
# Find CPU-hungry processes
ps aux --sort=-%cpu | head -10

# Check if a specific process is using too much CPU
top -p $(pgrep -d ',' apache2)

# Limit CPU usage for a process
cpulimit -p 1234 -l 50  # Limit PID 1234 to 50% CPU

# Set process niceness (lower priority)
renice +10 -p 1234

Memory Issues

SymptomPossible CausesInvestigation CommandsCommon Solutions
Out of memoryMemory leakps aux --sort=-%memRestart leaking service
 Swap misconfigurationswapon --show, cat /proc/swapsAdjust swappiness
 Too many processespmap -x <pid>Optimize processes
High RAM usageCachingfree -h, cat /proc/meminfoUnderstand cache behavior
 Database issuesmysqltunerTune database settings
 Memory fragmentationcat /proc/buddyinfoRestart service or system

Example: Memory Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check memory usage and cache
free -h
cat /proc/meminfo | grep -E 'Mem|Cache|Swap'

# Find memory-consuming processes
ps aux --sort=-%mem | head -10

# Examine detailed memory usage of a process
pmap -x $(pgrep mysql)

# Empty page cache (if needed)
echo 1 > /proc/sys/vm/drop_caches

# Monitor memory over time
watch -n 1 'free -h'

Disk Issues

SymptomPossible CausesInvestigation CommandsCommon Solutions
Disk fullLarge filesdu -sh /* | sort -hrClean up, compress, or archive
 Temp filesfind /tmp -type f -size +100MRemove temp files
 Log filesfind /var/log -type f -size +100MRotate logs, compress logs
 Orphaned fileslsof +L1Remove orphaned files
High I/OInefficient processiotopOptimize process I/O
 RAID issuescat /proc/mdstatCheck RAID status
 Filesystem fragmentatione2freefrag /dev/sda1Defragment if needed

Example: Disk Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find largest files
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5hr

# Find largest directories
du -sh /* 2>/dev/null | sort -hr | head -10

# Check inodes usage
df -i

# Check for open deleted files still consuming space
lsof +L1 | grep 'deleted'

# Check disk I/O by process
iotop -o

# Find recently modified large files
find / -type f -size +10M -mtime -7 -ls 2>/dev/null

Service Issues

SymptomPossible CausesInvestigation CommandsCommon Solutions
Service won’t startMisconfigurationsystemctl status <service>Fix configuration
 Dependenciessystemctl list-dependencies <service>Ensure dependencies run
 Permission issuesjournalctl -u <service>Fix permissions
Service crashesResource limitsulimit -aAdjust ulimit
 Bugsjournalctl -u <service> -p errUpdate software
 Incompatibilitiesldd $(which <binary>)Check library dependencies

Example: Service Troubleshooting

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Check service status
systemctl status nginx

# View service logs
journalctl -u nginx --since "1 hour ago"

# Check configuration syntax
nginx -t

# Examine service dependencies
systemctl list-dependencies nginx

# Check process limits
cat /proc/$(pidof nginx)/limits

# Check file descriptor usage
lsof -p $(pidof nginx) | wc -l

📊 Log Analysis Techniques

Logs are the first place professionals look when troubleshooting. Knowing how to effectively analyze logs is a core skill.

Key Log Files and Their Purpose

Log FileContentCommon Issues
/var/log/syslog or /var/log/messagesGeneral system messagesSystem-wide issues
/var/log/auth.log or /var/log/secureAuthentication eventsLogin failures, security
/var/log/kern.logKernel messagesHardware, driver issues
/var/log/dmesgBoot-time messagesBoot problems, hardware detection
/var/log/apache2/ or /var/log/httpd/Web server logsWeb application issues
/var/log/mysql/Database logsDatabase performance, errors
/var/log/apt/ or /var/log/yum.logPackage managementInstallation problems
/var/log/fail2ban.logIntrusion preventionSecurity violations

Effective Log Filtering

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Find error messages
grep -i error /var/log/syslog

# Show context around errors
grep -i -A3 -B2 "fatal" /var/log/apache2/error.log

# Filter by timestamp
grep "Apr 18 10:" /var/log/auth.log

# Filter by multiple patterns
grep -E "error|warning|critical" /var/log/syslog

# Filter out noise
grep -v "irrelevant pattern" /var/log/application.log

# Find recent authentication failures
grep "Failed password" /var/log/auth.log | tail -20

Advanced Log Analysis Tools

1
2
3
4
5
6
7
8
9
10
11
12
# Use journalctl for systemd logs
journalctl -u nginx --since "1 hour ago"
journalctl -p err..emerg --since today

# Use logwatch for summary reports
logwatch --service apache --range yesterday --detail high

# Use multitail to watch multiple logs
multitail /var/log/nginx/error.log /var/log/mysql/error.log

# Use lnav for interactive log viewing
lnav /var/log/syslog /var/log/auth.log

Tip: Regularly reviewing logs even when there are no apparent issues helps you understand what “normal” looks like. This baseline knowledge is invaluable when troubleshooting.

📈 Performance Troubleshooting

System performance issues require a structured approach to identify bottlenecks.

Performance Analysis Workflow

  1. Establish a baseline
    • Document normal performance metrics
    • Use historical data if available
  2. Identify symptoms
    • Slowness during specific operations
    • High load averages
    • Poor response times
  3. Check the four key resources
    • CPU
    • Memory
    • Disk I/O
    • Network
  4. Analyze processes
    • Which processes are consuming resources
    • Are there unexpected processes
    • Process relationships (parent-child)
  5. Examine detailed metrics
    • System calls
    • Context switches
    • File descriptors
    • Thread counts

Performance Monitoring Tools

ToolPurposeExample Usage
top/htopReal-time process monitoringhtop -d 5
vmstatVirtual memory statisticsvmstat 5 10
iostatI/O statisticsiostat -xz 5 10
mpstatMulti-processor statisticsmpstat -P ALL 5 10
sarSystem activity reportersar -n DEV 5 10
straceTrace system callsstrace -p <pid>
perfPerformance analysisperf top -p <pid>
netstat/ssNetwork statisticsss -tunapl
iotopI/O monitoringiotop -oPa
nmonPerformance monitoringnmon

CPU Profiling

1
2
3
4
5
6
7
8
9
10
# Basic CPU profiling
perf record -g -p <pid> -- sleep 30
perf report

# CPU flame graph (requires FlameGraph tools)
perf record -g -p <pid> -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flame.svg

# Process CPU time breakdown
pidstat -t -p <pid> 1 10

Memory Profiling

1
2
3
4
5
6
7
8
9
# Memory usage patterns
valgrind --tool=massif --massif-out-file=massif.out <program>
ms_print massif.out

# Process memory maps
pmap -x <pid>

# Track memory allocations
mtrace <program>

Disk I/O Profiling

1
2
3
4
5
6
7
8
# Disk I/O by process
iotop -aoP

# File system latency
ioping -c 10 /path/to/directory

# Block I/O tracing
blktrace -d /dev/sda -o - | blkparse -i -

Network Profiling

1
2
3
4
5
6
7
8
9
# Network traffic by process
nethogs

# Packet capture and analysis
tcpdump -i eth0 port 80 -w capture.pcap
wireshark capture.pcap

# Network connections by process
ss -tp

Note: Many of these specialized tools may need to be installed separately with your package manager.

🌐 Network Diagnostics

Network issues are common and can be challenging to troubleshoot without a systematic approach.

Network Troubleshooting Checklist

  1. Verify physical connectivity
    • Check cables, link lights, etc.
    • Verify interface status with ip link
  2. Check IP configuration
    • Proper IP address, subnet, gateway
    • DNS server configuration
  3. Test basic connectivity
    • Local network with ping
    • External networks with traceroute/mtr
  4. Check network services
    • Service status and ports
    • Firewall rules
  5. Analyze packet flow
    • Capture and analyze traffic
    • Check for packet loss or latency

Essential Network Diagnostic Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Check network interfaces
ip addr
ip link

# Check routing
ip route
route -n

# DNS resolution
nslookup google.com
dig google.com

# Connectivity testing
ping -c 4 8.8.8.8
traceroute google.com
mtr google.com

# Port checking
nc -zv google.com 443
telnet google.com 443

# Open connections
ss -tunapl
netstat -tunapl

# Packet capture
tcpdump -i eth0 host 8.8.8.8

Common Network Issues and Solutions

SymptomPossible CausesInvestigation CommandsCommon Solutions
No connectivityWrong IP/subnetip addr, ip routeFix configuration
 Interface downip linkBring up interface
 Routing issuetraceroute, mtrCheck/fix routing table
DNS problemsWrong DNS serverscat /etc/resolv.confFix DNS configuration
 DNS service issuesdig @8.8.8.8 google.comTry alternative DNS
 Caching issuessystemctl restart systemd-resolvedRestart DNS service
Can’t connect to serviceService downsystemctl status <service>Start the service
 Firewall blockingiptables -L, ufw statusAdjust firewall rules
 Wrong portss -tunapl \| grep <service>Configure correct port

Network Troubleshooting Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Troubleshoot DNS
dig +trace google.com
host -v google.com
resolvectl status

# Troubleshoot routing
ip route get 8.8.8.8
traceroute -n 8.8.8.8
mtr -n 8.8.8.8

# Troubleshoot network performance
iperf3 -c iperf.server.com
ping -c 100 -i 0.2 8.8.8.8 | grep -oP '\d+\.\d+(?=/)' | awk '{sum+=$1} END {print sum/NR}'

# Check for packet loss
ping -c 100 8.8.8.8 | grep -oP '\d+(?=% packet loss)'

# Find which process is using a port
lsof -i :80
ss -tunapl | grep :80

Tip: When troubleshooting network issues, work from the lowest layer up: physical → link → network → transport → application.

📝 Documentation Practices

Professional troubleshooters document their processes and findings methodically.

Why Document?

  1. Future reference - Similar issues can recur
  2. Knowledge sharing - Help others learn from your experience
  3. Audit trail - Record what changed and why
  4. Process improvement - Identify patterns and systemic issues
  5. Handover - Enable others to continue your work

What to Document

BeginnerProfessional
Just the solutionProblem description, diagnosis steps, and solution
Commands usedCommands with expected and actual outputs
Simplified stepsDetailed process with reasoning
Personal notesShareable, searchable documentation
Without contextIncluding timestamps and system state

Documentation Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Incident Report: [Brief Description]

## Summary
- **Date/Time**: [When the issue occurred]
- **System(s) Affected**: [Specific hosts/services]
- **Impact**: [User/business impact]
- **Root Cause**: [Brief cause statement]

## Timeline
- **[Time]**: Issue detected [how it was detected]
- **[Time]**: Initial investigation began
- **[Time]**: [Investigation step]
- **[Time]**: Root cause identified
- **[Time]**: Solution implemented
- **[Time]**: Service restored

## Investigation Details
1. **Initial symptoms observed**:

[Output of initial commands/logs]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2. **Diagnostic steps**:
   - Checked [system component] using:
     ```
     [command and output]
     ```
   - Verified [condition] by:
     ```
     [command and output]
     ```

3. **Root cause analysis**:
   [Explanation of what caused the issue]

## Solution
1. **Immediate fix**:

[Commands/actions taken to resolve]

1
2
2. **Verification**:

[Commands/checks performed to verify fix]

1
2
3
4
5
6
7
8
9
10
## Prevention
- **Monitoring**: [New monitoring/alerts implemented]
- **Automation**: [Automation to prevent recurrence]
- **Process**: [Process changes recommended]

## References
- [Relevant documentation links]
- [Knowledge base articles]
- [Similar past incidents]

Documentation Best Practices

  1. Use version control for configuration files and scripts
  2. Maintain a systems journal to track changes
  3. Create a knowledge base of common issues and solutions
  4. Use standard templates for different types of documentation
  5. Include context (what, why, when, how) in all documentation
  6. Make documentation searchable with tags and categories
  7. Review and update documentation regularly

Tip: Good documentation transforms individual knowledge into team knowledge, reducing “bus factor” risk.

🔄 Recovery Tools and Techniques

When things go seriously wrong, professionals have a toolkit ready to recover systems.

Rescue Environments

ToolDescriptionWhen to Use
SystemRescueLive Linux environmentFile recovery, system repair
GParted LivePartition managementDisk partitioning issues
Boot-Repair-DiskBoot loader repairGRUB/boot problems
ClonezillaDisk cloningSystem backup/restore
DBANDisk wipingSecure data destruction

Common Recovery Commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# File system check
fsck -f /dev/sda1

# Repair boot loader
grub-install /dev/sda
update-grub

# Recover deleted files
extundelete /dev/sda1 --restore-file /path/to/file

# Rescue data from failing disk
ddrescue /dev/sda /dev/sdb rescue.log

# Mount filesystem read-only for inspection
mount -o ro /dev/sda1 /mnt/recovery

# Check disk for bad sectors
badblocks -v /dev/sda

Recovery Strategies

  1. Boot issues
    • Use rescue disk to boot
    • Check and repair boot loader
    • Examine boot logs
  2. File system corruption
    • Mount read-only if possible
    • Run fsck on unmounted filesystems
    • Check SMART status of disks
  3. Database recovery
    • Use database-specific tools (e.g., mysqlcheck)
    • Apply transaction logs
    • Restore from backups
  4. Compromised systems
    • Isolate the system
    • Collect forensic evidence
    • Scan for malware
    • Rebuild from known good state

Recovery Plan Template

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. Boot from rescue media
# [Instructions for specific rescue disk]

# 2. Mount filesystem read-only
mount -o ro /dev/sda1 /mnt/recover

# 3. Back up critical data
mkdir /mnt/backup
rsync -av /mnt/recover/home/ /mnt/backup/

# 4. Check filesystem
umount /mnt/recover
fsck -f /dev/sda1

# 5. Repair system files
mount /dev/sda1 /mnt/recover
chroot /mnt/recover

# 6. Repair boot loader
grub-install /dev/sda
update-grub

# 7. Verify critical services
systemctl list-unit-files --state=enabled

Warning: Always back up data before attempting system recovery. Test recovery procedures on non-critical systems before applying to production.

🔎 Preventative Measures

Professionals know that preventing problems is more efficient than fixing them.

Monitoring Tools

ToolDescriptionWhat to Monitor
NagiosNetwork/service monitoringService availability, performance
PrometheusMetrics and alertingSystem metrics, custom metrics
GrafanaMetrics visualizationDashboards for all metrics
ELK StackLog aggregation/analysisCentralized log management
ZabbixEnterprise monitoringInfrastructure monitoring
NetdataReal-time monitoringLow-level system metrics

Preventative Checks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Scheduled disk checks
tune2fs -l /dev/sda1 | grep 'Mount count'

# Check for updates
apt update && apt list --upgradable

# Check disk space growth trends
df -h >> /var/log/disk_usage.log

# Check for failed services
systemctl --failed

# Verify backups
find /backup/ -name "*.tar.gz" -mtime -1 | wc -l

# Check for compromised packages
debsums -c

Configuration Management

Using configuration management tools helps prevent configuration drift:

  1. Ansible - Agentless configuration management
  2. Puppet - Declarative configuration management
  3. Chef - Infrastructure as code
  4. Salt - Event-driven automation
  5. Terraform - Infrastructure provisioning

Regular Maintenance Schedule

IntervalTaskCommand Example
DailyCheck disk spacedf -h
 Review logsjournalctl -p err -s yesterday
 Verify backupsls -l /backup/daily/
WeeklyUpdate packagesapt update && apt upgrade
 Check service statussystemctl list-units --state=failed
 Performance baselinesar -A > /var/log/performance/weekly.log
MonthlyFull security scanlynis audit system
 User account auditgetent passwd \| sort
 Filesystem checktune2fs -C 999 /dev/sda1
QuarterlyDisaster recovery test[Restore test procedure]
 Performance reviewcompare baseline metrics
 Security audit[Security checklist]

Tip: Automate as many maintenance tasks as possible. Human-performed maintenance should focus on reviewing automated reports and addressing exceptions.

📋 Log Management and Standardization

Effective log management goes beyond just viewing logs when problems occur.

Log Rotation - Beyond Default Settings

BeginnerProfessional
Uses default log rotation settingsCustomizes rotation based on system purpose
Ignores log files until disk space issuesProactively manages log growth
Manually deletes old logsConfigures automated rotation policies
One-size-fits-all approachService-specific rotation strategies

Key Log Rotation Parameters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Example logrotate configuration
/var/log/myapp/*.log {
    daily                   # Rotate frequency
    rotate 14               # Keep 14 rotated logs
    compress                # Compress rotated logs
    delaycompress           # Delay compression by one cycle
    missingok               # Don't error if log is missing
    notifempty              # Don't rotate empty logs
    create 0640 www-data    # Create new log with permissions
    sharedscripts           # Run scripts once per rotation
    postrotate
        systemctl reload myapp
    endscript
}

Standardizing Log Formats

Professional logging includes standardized formats with these elements:

  1. Timestamps - ISO 8601 format (YYYY-MM-DDTHH:MM:SS.sss±ZZZZ)
  2. Severity levels - Use standard levels (ERROR, WARN, INFO, DEBUG)
  3. Source identification - Service, module, or function
  4. Correlation IDs - To track related events across services
  5. Structured data - JSON or similar for machine parsing
  6. Context - User IDs, request IDs, or other relevant context

Example Standardized Log Format

1
2025-04-18T10:30:45.123+0530 INFO [web-server] [request-id: 1a2b3c4d] User authenticated: user_id=123, source_ip=192.168.1.100, result=success

Application Logging Best Practices

Configure applications to use these logging best practices:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Example rsyslog configuration
# /etc/rsyslog.d/myapp.conf

template(name="JsonFormat" type="list") {
    constant(value="{")
    constant(value="\"timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
    constant(value="\",\"host\":\"")       property(name="hostname")
    constant(value="\",\"severity\":\"")   property(name="syslogseverity-text")
    constant(value="\",\"facility\":\"")   property(name="syslogfacility-text")
    constant(value="\",\"tag\":\"")        property(name="syslogtag" format="json")
    constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}

# Send myapp logs to a dedicated file in JSON format
if $programname == 'myapp' then {
    action(type="omfile" file="/var/log/myapp/application.log" template="JsonFormat")
}

Centralized Logging

Professionals often implement centralized logging for easier analysis:

  1. Collection - Filebeat, Fluentd, Logstash
  2. Storage - Elasticsearch, Loki, Graylog
  3. Analysis - Kibana, Grafana
  4. Alerting - ElastAlert, Grafana alerts

Example Filebeat Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/myapp/*.log
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: message

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "myapp-%{+YYYY.MM.dd}"

Tip: Design logs for both human readability and machine parsing. Structured logs are easier to search, filter, and analyze at scale.

📌 Final Thought

“Problems are inevitable; being unprepared is optional.”

Effective troubleshooting is not just a technical skill—it’s a mindset. By developing a systematic approach to diagnosing and resolving issues, you transform from someone who fights fires to someone who manages systems proactively.

The difference between a beginner and a professional is not just knowledge of commands or tools, but the discipline to follow methodical processes, document thoroughly, and learn from each incident.

Remember that the best troubleshooters are those who:

  1. Understand their systems deeply
  2. Follow structured approaches
  3. Document everything
  4. Learn from every incident
  5. Implement preventative measures

This checklist is your roadmap to developing that professional troubleshooting mindset.

This post is licensed under CC BY 4.0 by the author.