Linux Troubleshooting Checklist - A Professional Guide

A comprehensive guide to systematic Linux troubleshooting, covering system health checks, performance analysis, and standardized logging practices.

Posted Apr 16, 2025

By Sreeju KS

20 min read

“Effective troubleshooting isn’t about knowing solutions—it’s about following a systematic process to discover them.”

Systematic Approach to Troubleshooting
System Health Checks
Common Problems and Solutions
Log Analysis Techniques
Performance Troubleshooting
Network Diagnostics
Documentation Practices
Recovery Tools and Techniques
Preventative Measures
Log Management and Standardization
Final Thought

🧠 Systematic Approach to Troubleshooting

When a system fails, how you approach the problem reveals your level of experience and professionalism.

Beginner	Professional
Randomly tries solutions	Follows a methodical process
Relies on memorized fixes	Understands the underlying system
Gets frustrated easily	Remains calm and systematic
Reinstalls as first resort	Identifies root cause before fixing
Focuses only on symptoms	Understands cause-effect relationships
Overlooks documentation	Consults logs and documentation first

The Professional Troubleshooting Framework

Identify and Isolate
- Define the problem precisely
- Determine when it started
- Identify affected components
Gather Information
- Check logs (system, application, security)
- Review recent changes
- Verify resource availability (CPU, memory, disk)
Form Hypothesis
- Based on evidence, not guesses
- Consider multiple potential causes
- Prioritize by likelihood and impact
Test Hypothesis
- Make one change at a time
- Document each test and result
- Use reversal tests when appropriate
Implement Solution
- Apply the proven fix
- Verify full functionality
- Document the resolution
Review and Learn
- Analyze the root cause
- Improve monitoring for early detection
- Share knowledge with team

Tip: Always ask, “What changed?” Most problems occur after a system alteration, whether obvious (like a software update) or subtle (like a configuration change).

🔍 System Health Checks

Before diving into specific problems, professional Linux administrators perform comprehensive health checks to establish a baseline and identify issues.

Essential System Status Commands

        
      
# Overall system status
uptime
vmstat 1 5
top -b -n 1

# Memory usage
free -h
cat /proc/meminfo

# Disk usage
df -h
du -sh /* 2>/dev/null | sort -hr

# CPU information
cat /proc/cpuinfo
lscpu

# Process status
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10

Quick Health Check Script

        
      
#!/bin/bash
# quick-health-check.sh
# Usage: ./quick-health-check.sh [output_file]

OUTPUT=${1:-health_$(hostname)_$(date +%Y%m%d_%H%M%S).log}

{
  echo "========== SYSTEM INFO =========="
  echo "Date: $(date)"
  echo "Hostname: $(hostname)"
  echo "Kernel: $(uname -r)"
  echo "Uptime: $(uptime)"
  
  echo -e "\n========== CPU USAGE =========="
  echo "Load average:"
  uptime
  echo -e "\nTop CPU processes:"
  ps aux --sort=-%cpu | head -6
  
  echo -e "\n========== MEMORY USAGE =========="
  free -h
  echo -e "\nTop memory processes:"
  ps aux --sort=-%mem | head -6
  
  echo -e "\n========== DISK USAGE =========="
  df -h
  echo -e "\nLargest directories:"
  du -sh /* 2>/dev/null | sort -hr | head -5
  
  echo -e "\n========== NETWORK STATUS =========="
  echo "Network interfaces:"
  ip -br addr
  echo -e "\nOpen connections:"
  ss -tuln
  
  echo -e "\n========== RECENT ERRORS =========="
  echo "Last 5 system errors:"
  journalctl -p err -n 5 --no-pager
} | tee "$OUTPUT"

echo "Health check complete. Results saved to $OUTPUT"

Advanced System Monitoring Commands

        
      
# System activity reports
sar -u 1 5         # CPU usage
sar -r 1 5         # Memory usage
sar -b 1 5         # I/O statistics
sar -n DEV 1 5     # Network statistics

# Process monitoring
htop               # Interactive process viewer
atop               # For system bottleneck analysis
lsof               # List open files

# Disk I/O monitoring
iostat -xz 1 5     # Extended disk statistics
iotop              # I/O monitoring by process

Warning: Some monitoring tools like htop, atop, and iotop may need to be installed separately as they’re not always included in minimal installations.

🛠️ Common Problems and Solutions

CPU Issues

Symptom	Possible Causes	Investigation Commands	Common Solutions
High CPU usage	Runaway process	`ps aux --sort=-%cpu \| head`	Kill/restart process
	Malware/cryptominer	`ps aux \| grep -i "crypto\\|mine\\|coin"`	Identify and remove
	Service misconfiguration	`systemctl status <service>`	Reconfigure service
System slowness	Too many processes	`pstree -p`	Optimize startup services
	Resource contention	`nice`, `ionice`, `cpulimit`	Apply resource constraints
	Kernel issues	`dmesg \| grep -i error`	Update kernel

Example: CPU Troubleshooting

        
      
# Find CPU-hungry processes
ps aux --sort=-%cpu | head -10

# Check if a specific process is using too much CPU
top -p $(pgrep -d ',' apache2)

# Limit CPU usage for a process
cpulimit -p 1234 -l 50  # Limit PID 1234 to 50% CPU

# Set process niceness (lower priority)
renice +10 -p 1234

Memory Issues

Symptom	Possible Causes	Investigation Commands	Common Solutions
Out of memory	Memory leak	`ps aux --sort=-%mem`	Restart leaking service
	Swap misconfiguration	`swapon --show`, `cat /proc/swaps`	Adjust swappiness
	Too many processes	`pmap -x <pid>`	Optimize processes
High RAM usage	Caching	`free -h`, `cat /proc/meminfo`	Understand cache behavior
	Database issues	`mysqltuner`	Tune database settings
	Memory fragmentation	`cat /proc/buddyinfo`	Restart service or system

Example: Memory Troubleshooting

        
      
# Check memory usage and cache
free -h
cat /proc/meminfo | grep -E 'Mem|Cache|Swap'

# Find memory-consuming processes
ps aux --sort=-%mem | head -10

# Examine detailed memory usage of a process
pmap -x $(pgrep mysql)

# Empty page cache (if needed)
echo 1 > /proc/sys/vm/drop_caches

# Monitor memory over time
watch -n 1 'free -h'

Disk Issues

Symptom	Possible Causes	Investigation Commands	Common Solutions
Disk full	Large files	`du -sh /* \| sort -hr`	Clean up, compress, or archive
	Temp files	`find /tmp -type f -size +100M`	Remove temp files
	Log files	`find /var/log -type f -size +100M`	Rotate logs, compress logs
	Orphaned files	`lsof +L1`	Remove orphaned files
High I/O	Inefficient process	`iotop`	Optimize process I/O
	RAID issues	`cat /proc/mdstat`	Check RAID status
	Filesystem fragmentation	`e2freefrag /dev/sda1`	Defragment if needed

Example: Disk Troubleshooting

        
      
# Find largest files
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5hr

# Find largest directories
du -sh /* 2>/dev/null | sort -hr | head -10

# Check inodes usage
df -i

# Check for open deleted files still consuming space
lsof +L1 | grep 'deleted'

# Check disk I/O by process
iotop -o

# Find recently modified large files
find / -type f -size +10M -mtime -7 -ls 2>/dev/null

Service Issues

Symptom	Possible Causes	Investigation Commands	Common Solutions
Service won’t start	Misconfiguration	`systemctl status <service>`	Fix configuration
	Dependencies	`systemctl list-dependencies <service>`	Ensure dependencies run
	Permission issues	`journalctl -u <service>`	Fix permissions
Service crashes	Resource limits	`ulimit -a`	Adjust ulimit
	Bugs	`journalctl -u <service> -p err`	Update software
	Incompatibilities	`ldd $(which <binary>)`	Check library dependencies

Example: Service Troubleshooting

        
      
# Check service status
systemctl status nginx

# View service logs
journalctl -u nginx --since "1 hour ago"

# Check configuration syntax
nginx -t

# Examine service dependencies
systemctl list-dependencies nginx

# Check process limits
cat /proc/$(pidof nginx)/limits

# Check file descriptor usage
lsof -p $(pidof nginx) | wc -l

📊 Log Analysis Techniques

Logs are the first place professionals look when troubleshooting. Knowing how to effectively analyze logs is a core skill.

Key Log Files and Their Purpose

Log File	Content	Common Issues
`/var/log/syslog` or `/var/log/messages`	General system messages	System-wide issues
`/var/log/auth.log` or `/var/log/secure`	Authentication events	Login failures, security
`/var/log/kern.log`	Kernel messages	Hardware, driver issues
`/var/log/dmesg`	Boot-time messages	Boot problems, hardware detection
`/var/log/apache2/` or `/var/log/httpd/`	Web server logs	Web application issues
`/var/log/mysql/`	Database logs	Database performance, errors
`/var/log/apt/` or `/var/log/yum.log`	Package management	Installation problems
`/var/log/fail2ban.log`	Intrusion prevention	Security violations

Effective Log Filtering

        
      
# Find error messages
grep -i error /var/log/syslog

# Show context around errors
grep -i -A3 -B2 "fatal" /var/log/apache2/error.log

# Filter by timestamp
grep "Apr 18 10:" /var/log/auth.log

# Filter by multiple patterns
grep -E "error|warning|critical" /var/log/syslog

# Filter out noise
grep -v "irrelevant pattern" /var/log/application.log

# Find recent authentication failures
grep "Failed password" /var/log/auth.log | tail -20

Advanced Log Analysis Tools

        
      
# Use journalctl for systemd logs
journalctl -u nginx --since "1 hour ago"
journalctl -p err..emerg --since today

# Use logwatch for summary reports
logwatch --service apache --range yesterday --detail high

# Use multitail to watch multiple logs
multitail /var/log/nginx/error.log /var/log/mysql/error.log

# Use lnav for interactive log viewing
lnav /var/log/syslog /var/log/auth.log

Tip: Regularly reviewing logs even when there are no apparent issues helps you understand what “normal” looks like. This baseline knowledge is invaluable when troubleshooting.

📈 Performance Troubleshooting

System performance issues require a structured approach to identify bottlenecks.

Performance Analysis Workflow

Establish a baseline
- Document normal performance metrics
- Use historical data if available
Identify symptoms
- Slowness during specific operations
- High load averages
- Poor response times
Check the four key resources
- CPU
- Memory
- Disk I/O
- Network
Analyze processes
- Which processes are consuming resources
- Are there unexpected processes
- Process relationships (parent-child)
Examine detailed metrics
- System calls
- Context switches
- File descriptors
- Thread counts

Performance Monitoring Tools

Tool	Purpose	Example Usage
`top`/`htop`	Real-time process monitoring	`htop -d 5`
`vmstat`	Virtual memory statistics	`vmstat 5 10`
`iostat`	I/O statistics	`iostat -xz 5 10`
`mpstat`	Multi-processor statistics	`mpstat -P ALL 5 10`
`sar`	System activity reporter	`sar -n DEV 5 10`
`strace`	Trace system calls	`strace -p <pid>`
`perf`	Performance analysis	`perf top -p <pid>`
`netstat`/`ss`	Network statistics	`ss -tunapl`
`iotop`	I/O monitoring	`iotop -oPa`
`nmon`	Performance monitoring	`nmon`

CPU Profiling

        
      
# Basic CPU profiling
perf record -g -p <pid> -- sleep 30
perf report

# CPU flame graph (requires FlameGraph tools)
perf record -g -p <pid> -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flame.svg

# Process CPU time breakdown
pidstat -t -p <pid> 1 10

Memory Profiling

        
      
# Memory usage patterns
valgrind --tool=massif --massif-out-file=massif.out <program>
ms_print massif.out

# Process memory maps
pmap -x <pid>

# Track memory allocations
mtrace <program>

Disk I/O Profiling

        
      
# Disk I/O by process
iotop -aoP

# File system latency
ioping -c 10 /path/to/directory

# Block I/O tracing
blktrace -d /dev/sda -o - | blkparse -i -

Network Profiling

        
      
# Network traffic by process
nethogs

# Packet capture and analysis
tcpdump -i eth0 port 80 -w capture.pcap
wireshark capture.pcap

# Network connections by process
ss -tp

Note: Many of these specialized tools may need to be installed separately with your package manager.

🌐 Network Diagnostics

Network issues are common and can be challenging to troubleshoot without a systematic approach.

Network Troubleshooting Checklist

Verify physical connectivity
- Check cables, link lights, etc.
- Verify interface status with ip link
Check IP configuration
- Proper IP address, subnet, gateway
- DNS server configuration
Test basic connectivity
- Local network with ping
- External networks with traceroute/mtr
Check network services
- Service status and ports
- Firewall rules
Analyze packet flow
- Capture and analyze traffic
- Check for packet loss or latency

Essential Network Diagnostic Commands

        
      
# Check network interfaces
ip addr
ip link

# Check routing
ip route
route -n

# DNS resolution
nslookup google.com
dig google.com

# Connectivity testing
ping -c 4 8.8.8.8
traceroute google.com
mtr google.com

# Port checking
nc -zv google.com 443
telnet google.com 443

# Open connections
ss -tunapl
netstat -tunapl

# Packet capture
tcpdump -i eth0 host 8.8.8.8

Common Network Issues and Solutions

Symptom	Possible Causes	Investigation Commands	Common Solutions
No connectivity	Wrong IP/subnet	`ip addr`, `ip route`	Fix configuration
	Interface down	`ip link`	Bring up interface
	Routing issue	`traceroute`, `mtr`	Check/fix routing table
DNS problems	Wrong DNS servers	`cat /etc/resolv.conf`	Fix DNS configuration
	DNS service issues	`dig @8.8.8.8 google.com`	Try alternative DNS
	Caching issues	`systemctl restart systemd-resolved`	Restart DNS service
Can’t connect to service	Service down	`systemctl status <service>`	Start the service
	Firewall blocking	`iptables -L`, `ufw status`	Adjust firewall rules
	Wrong port	`ss -tunapl \\| grep <service>`	Configure correct port

Network Troubleshooting Examples

        
      
# Troubleshoot DNS
dig +trace google.com
host -v google.com
resolvectl status

# Troubleshoot routing
ip route get 8.8.8.8
traceroute -n 8.8.8.8
mtr -n 8.8.8.8

# Troubleshoot network performance
iperf3 -c iperf.server.com
ping -c 100 -i 0.2 8.8.8.8 | grep -oP '\d+\.\d+(?=/)' | awk '{sum+=$1} END {print sum/NR}'

# Check for packet loss
ping -c 100 8.8.8.8 | grep -oP '\d+(?=% packet loss)'

# Find which process is using a port
lsof -i :80
ss -tunapl | grep :80

Tip: When troubleshooting network issues, work from the lowest layer up: physical → link → network → transport → application.

📝 Documentation Practices

Professional troubleshooters document their processes and findings methodically.

Why Document?

Future reference - Similar issues can recur
Knowledge sharing - Help others learn from your experience
Audit trail - Record what changed and why
Process improvement - Identify patterns and systemic issues
Handover - Enable others to continue your work

What to Document

Beginner	Professional
Just the solution	Problem description, diagnosis steps, and solution
Commands used	Commands with expected and actual outputs
Simplified steps	Detailed process with reasoning
Personal notes	Shareable, searchable documentation
Without context	Including timestamps and system state

Documentation Template

        
      
# Incident Report: [Brief Description]

## Summary
- **Date/Time**: [When the issue occurred]
- **System(s) Affected**: [Specific hosts/services]
- **Impact**: [User/business impact]
- **Root Cause**: [Brief cause statement]

## Timeline
- **[Time]**: Issue detected [how it was detected]
- **[Time]**: Initial investigation began
- **[Time]**: [Investigation step]
- **[Time]**: Root cause identified
- **[Time]**: Solution implemented
- **[Time]**: Service restored

## Investigation Details
1. **Initial symptoms observed**:

[Output of initial commands/logs]

2. **Diagnostic steps**:
   - Checked [system component] using:
     ```
     [command and output]
     ```
   - Verified [condition] by:
     ```
     [command and output]
     ```

3. **Root cause analysis**:
   [Explanation of what caused the issue]

## Solution
1. **Immediate fix**:

[Commands/actions taken to resolve]

2. **Verification**:

[Commands/checks performed to verify fix]

## Prevention
- **Monitoring**: [New monitoring/alerts implemented]
- **Automation**: [Automation to prevent recurrence]
- **Process**: [Process changes recommended]

## References
- [Relevant documentation links]
- [Knowledge base articles]
- [Similar past incidents]

Documentation Best Practices

Use version control for configuration files and scripts
Maintain a systems journal to track changes
Create a knowledge base of common issues and solutions
Use standard templates for different types of documentation
Include context (what, why, when, how) in all documentation
Make documentation searchable with tags and categories
Review and update documentation regularly

Tip: Good documentation transforms individual knowledge into team knowledge, reducing “bus factor” risk.

🔄 Recovery Tools and Techniques

When things go seriously wrong, professionals have a toolkit ready to recover systems.

Rescue Environments

Tool	Description	When to Use
SystemRescue	Live Linux environment	File recovery, system repair
GParted Live	Partition management	Disk partitioning issues
Boot-Repair-Disk	Boot loader repair	GRUB/boot problems
Clonezilla	Disk cloning	System backup/restore
DBAN	Disk wiping	Secure data destruction

Common Recovery Commands

        
      
# File system check
fsck -f /dev/sda1

# Repair boot loader
grub-install /dev/sda
update-grub

# Recover deleted files
extundelete /dev/sda1 --restore-file /path/to/file

# Rescue data from failing disk
ddrescue /dev/sda /dev/sdb rescue.log

# Mount filesystem read-only for inspection
mount -o ro /dev/sda1 /mnt/recovery

# Check disk for bad sectors
badblocks -v /dev/sda

Recovery Strategies

Boot issues
- Use rescue disk to boot
- Check and repair boot loader
- Examine boot logs
File system corruption
- Mount read-only if possible
- Run fsck on unmounted filesystems
- Check SMART status of disks
Database recovery
- Use database-specific tools (e.g., mysqlcheck)
- Apply transaction logs
- Restore from backups
Compromised systems
- Isolate the system
- Collect forensic evidence
- Scan for malware
- Rebuild from known good state

Recovery Plan Template

        
      
# 1. Boot from rescue media
# [Instructions for specific rescue disk]

# 2. Mount filesystem read-only
mount -o ro /dev/sda1 /mnt/recover

# 3. Back up critical data
mkdir /mnt/backup
rsync -av /mnt/recover/home/ /mnt/backup/

# 4. Check filesystem
umount /mnt/recover
fsck -f /dev/sda1

# 5. Repair system files
mount /dev/sda1 /mnt/recover
chroot /mnt/recover

# 6. Repair boot loader
grub-install /dev/sda
update-grub

# 7. Verify critical services
systemctl list-unit-files --state=enabled

Warning: Always back up data before attempting system recovery. Test recovery procedures on non-critical systems before applying to production.

🔎 Preventative Measures

Professionals know that preventing problems is more efficient than fixing them.

Monitoring Tools

Tool	Description	What to Monitor
Nagios	Network/service monitoring	Service availability, performance
Prometheus	Metrics and alerting	System metrics, custom metrics
Grafana	Metrics visualization	Dashboards for all metrics
ELK Stack	Log aggregation/analysis	Centralized log management
Zabbix	Enterprise monitoring	Infrastructure monitoring
Netdata	Real-time monitoring	Low-level system metrics

Preventative Checks

        
      
# Scheduled disk checks
tune2fs -l /dev/sda1 | grep 'Mount count'

# Check for updates
apt update && apt list --upgradable

# Check disk space growth trends
df -h >> /var/log/disk_usage.log

# Check for failed services
systemctl --failed

# Verify backups
find /backup/ -name "*.tar.gz" -mtime -1 | wc -l

# Check for compromised packages
debsums -c

Configuration Management

Using configuration management tools helps prevent configuration drift:

Ansible - Agentless configuration management
Puppet - Declarative configuration management
Chef - Infrastructure as code
Salt - Event-driven automation
Terraform - Infrastructure provisioning

Regular Maintenance Schedule

Interval	Task	Command Example
Daily	Check disk space	`df -h`
	Review logs	`journalctl -p err -s yesterday`
	Verify backups	`ls -l /backup/daily/`
Weekly	Update packages	`apt update && apt upgrade`
	Check service status	`systemctl list-units --state=failed`
	Performance baseline	`sar -A > /var/log/performance/weekly.log`
Monthly	Full security scan	`lynis audit system`
	User account audit	`getent passwd \\| sort`
	Filesystem check	`tune2fs -C 999 /dev/sda1`
Quarterly	Disaster recovery test	`[Restore test procedure]`
	Performance review	`compare baseline metrics`
	Security audit	`[Security checklist]`

Tip: Automate as many maintenance tasks as possible. Human-performed maintenance should focus on reviewing automated reports and addressing exceptions.

📋 Log Management and Standardization

Effective log management goes beyond just viewing logs when problems occur.

Log Rotation - Beyond Default Settings

Beginner	Professional
Uses default log rotation settings	Customizes rotation based on system purpose
Ignores log files until disk space issues	Proactively manages log growth
Manually deletes old logs	Configures automated rotation policies
One-size-fits-all approach	Service-specific rotation strategies

Key Log Rotation Parameters

        
      
# Example logrotate configuration
/var/log/myapp/*.log {
    daily                   # Rotate frequency
    rotate 14               # Keep 14 rotated logs
    compress                # Compress rotated logs
    delaycompress           # Delay compression by one cycle
    missingok               # Don't error if log is missing
    notifempty              # Don't rotate empty logs
    create 0640 www-data    # Create new log with permissions
    sharedscripts           # Run scripts once per rotation
    postrotate
        systemctl reload myapp
    endscript
}

Standardizing Log Formats

Professional logging includes standardized formats with these elements:

Timestamps - ISO 8601 format (YYYY-MM-DDTHH:MM:SS.sss±ZZZZ)
Severity levels - Use standard levels (ERROR, WARN, INFO, DEBUG)
Source identification - Service, module, or function
Correlation IDs - To track related events across services
Structured data - JSON or similar for machine parsing
Context - User IDs, request IDs, or other relevant context

Example Standardized Log Format

2025-04-18T10:30:45.123+0530 INFO [web-server] [request-id: 1a2b3c4d] User authenticated: user_id=123, source_ip=192.168.1.100, result=success

Application Logging Best Practices

Configure applications to use these logging best practices:

        
      
# Example rsyslog configuration
# /etc/rsyslog.d/myapp.conf

template(name="JsonFormat" type="list") {
    constant(value="{")
    constant(value="\"timestamp\":\"")     property(name="timereported" dateFormat="rfc3339")
    constant(value="\",\"host\":\"")       property(name="hostname")
    constant(value="\",\"severity\":\"")   property(name="syslogseverity-text")
    constant(value="\",\"facility\":\"")   property(name="syslogfacility-text")
    constant(value="\",\"tag\":\"")        property(name="syslogtag" format="json")
    constant(value="\",\"message\":\"")    property(name="msg" format="json")
    constant(value="\"}")
}

# Send myapp logs to a dedicated file in JSON format
if $programname == 'myapp' then {
    action(type="omfile" file="/var/log/myapp/application.log" template="JsonFormat")
}

Centralized Logging

Professionals often implement centralized logging for easier analysis:

Collection - Filebeat, Fluentd, Logstash
Storage - Elasticsearch, Loki, Graylog
Analysis - Kibana, Grafana
Alerting - ElastAlert, Grafana alerts

Example Filebeat Configuration

        
      
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/myapp/*.log
  json.keys_under_root: true
  json.add_error_key: true
  json.message_key: message

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "myapp-%{+YYYY.MM.dd}"

Tip: Design logs for both human readability and machine parsing. Structured logs are easier to search, filter, and analyze at scale.

📌 Final Thought

“Problems are inevitable; being unprepared is optional.”

Effective troubleshooting is not just a technical skill—it’s a mindset. By developing a systematic approach to diagnosing and resolving issues, you transform from someone who fights fires to someone who manages systems proactively.

The difference between a beginner and a professional is not just knowledge of commands or tools, but the discipline to follow methodical processes, document thoroughly, and learn from each incident.

Remember that the best troubleshooters are those who:

Understand their systems deeply
Follow structured approaches
Document everything
Learn from every incident
Implement preventative measures

This checklist is your roadmap to developing that professional troubleshooting mindset.

tutorials, system-administration, troubleshooting

This post is licensed under CC BY 4.0 by the author.