“Effective troubleshooting isn’t about knowing solutions—it’s about following a systematic process to discover them.”
Table of Contents
🧠 Systematic Approach to Troubleshooting
When a system fails, how you approach the problem reveals your level of experience and professionalism.
| Beginner | Professional |
|---|
| Randomly tries solutions | Follows a methodical process |
| Relies on memorized fixes | Understands the underlying system |
| Gets frustrated easily | Remains calm and systematic |
| Reinstalls as first resort | Identifies root cause before fixing |
| Focuses only on symptoms | Understands cause-effect relationships |
| Overlooks documentation | Consults logs and documentation first |
The Professional Troubleshooting Framework
- Identify and Isolate
- Define the problem precisely
- Determine when it started
- Identify affected components
- Gather Information
- Check logs (system, application, security)
- Review recent changes
- Verify resource availability (CPU, memory, disk)
- Form Hypothesis
- Based on evidence, not guesses
- Consider multiple potential causes
- Prioritize by likelihood and impact
- Test Hypothesis
- Make one change at a time
- Document each test and result
- Use reversal tests when appropriate
- Implement Solution
- Apply the proven fix
- Verify full functionality
- Document the resolution
- Review and Learn
- Analyze the root cause
- Improve monitoring for early detection
- Share knowledge with team
Tip: Always ask, “What changed?” Most problems occur after a system alteration, whether obvious (like a software update) or subtle (like a configuration change).
🔍 System Health Checks
Before diving into specific problems, professional Linux administrators perform comprehensive health checks to establish a baseline and identify issues.
Essential System Status Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # Overall system status
uptime
vmstat 1 5
top -b -n 1
# Memory usage
free -h
cat /proc/meminfo
# Disk usage
df -h
du -sh /* 2>/dev/null | sort -hr
# CPU information
cat /proc/cpuinfo
lscpu
# Process status
ps aux --sort=-%cpu | head -10
ps aux --sort=-%mem | head -10
|
Quick Health Check Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| #!/bin/bash
# quick-health-check.sh
# Usage: ./quick-health-check.sh [output_file]
OUTPUT=${1:-health_$(hostname)_$(date +%Y%m%d_%H%M%S).log}
{
echo "========== SYSTEM INFO =========="
echo "Date: $(date)"
echo "Hostname: $(hostname)"
echo "Kernel: $(uname -r)"
echo "Uptime: $(uptime)"
echo -e "\n========== CPU USAGE =========="
echo "Load average:"
uptime
echo -e "\nTop CPU processes:"
ps aux --sort=-%cpu | head -6
echo -e "\n========== MEMORY USAGE =========="
free -h
echo -e "\nTop memory processes:"
ps aux --sort=-%mem | head -6
echo -e "\n========== DISK USAGE =========="
df -h
echo -e "\nLargest directories:"
du -sh /* 2>/dev/null | sort -hr | head -5
echo -e "\n========== NETWORK STATUS =========="
echo "Network interfaces:"
ip -br addr
echo -e "\nOpen connections:"
ss -tuln
echo -e "\n========== RECENT ERRORS =========="
echo "Last 5 system errors:"
journalctl -p err -n 5 --no-pager
} | tee "$OUTPUT"
echo "Health check complete. Results saved to $OUTPUT"
|
Advanced System Monitoring Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # System activity reports
sar -u 1 5 # CPU usage
sar -r 1 5 # Memory usage
sar -b 1 5 # I/O statistics
sar -n DEV 1 5 # Network statistics
# Process monitoring
htop # Interactive process viewer
atop # For system bottleneck analysis
lsof # List open files
# Disk I/O monitoring
iostat -xz 1 5 # Extended disk statistics
iotop # I/O monitoring by process
|
Warning: Some monitoring tools like htop, atop, and iotop may need to be installed separately as they’re not always included in minimal installations.
🛠️ Common Problems and Solutions
CPU Issues
| Symptom | Possible Causes | Investigation Commands | Common Solutions |
|---|
| High CPU usage | Runaway process | ps aux --sort=-%cpu | head | Kill/restart process |
| | Malware/cryptominer | ps aux | grep -i "crypto\|mine\|coin" | Identify and remove |
| | Service misconfiguration | systemctl status <service> | Reconfigure service |
| System slowness | Too many processes | pstree -p | Optimize startup services |
| | Resource contention | nice, ionice, cpulimit | Apply resource constraints |
| | Kernel issues | dmesg | grep -i error | Update kernel |
Example: CPU Troubleshooting
1
2
3
4
5
6
7
8
9
10
11
| # Find CPU-hungry processes
ps aux --sort=-%cpu | head -10
# Check if a specific process is using too much CPU
top -p $(pgrep -d ',' apache2)
# Limit CPU usage for a process
cpulimit -p 1234 -l 50 # Limit PID 1234 to 50% CPU
# Set process niceness (lower priority)
renice +10 -p 1234
|
Memory Issues
| Symptom | Possible Causes | Investigation Commands | Common Solutions |
|---|
| Out of memory | Memory leak | ps aux --sort=-%mem | Restart leaking service |
| | Swap misconfiguration | swapon --show, cat /proc/swaps | Adjust swappiness |
| | Too many processes | pmap -x <pid> | Optimize processes |
| High RAM usage | Caching | free -h, cat /proc/meminfo | Understand cache behavior |
| | Database issues | mysqltuner | Tune database settings |
| | Memory fragmentation | cat /proc/buddyinfo | Restart service or system |
Example: Memory Troubleshooting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Check memory usage and cache
free -h
cat /proc/meminfo | grep -E 'Mem|Cache|Swap'
# Find memory-consuming processes
ps aux --sort=-%mem | head -10
# Examine detailed memory usage of a process
pmap -x $(pgrep mysql)
# Empty page cache (if needed)
echo 1 > /proc/sys/vm/drop_caches
# Monitor memory over time
watch -n 1 'free -h'
|
Disk Issues
| Symptom | Possible Causes | Investigation Commands | Common Solutions |
|---|
| Disk full | Large files | du -sh /* | sort -hr | Clean up, compress, or archive |
| | Temp files | find /tmp -type f -size +100M | Remove temp files |
| | Log files | find /var/log -type f -size +100M | Rotate logs, compress logs |
| | Orphaned files | lsof +L1 | Remove orphaned files |
| High I/O | Inefficient process | iotop | Optimize process I/O |
| | RAID issues | cat /proc/mdstat | Check RAID status |
| | Filesystem fragmentation | e2freefrag /dev/sda1 | Defragment if needed |
Example: Disk Troubleshooting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Find largest files
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5hr
# Find largest directories
du -sh /* 2>/dev/null | sort -hr | head -10
# Check inodes usage
df -i
# Check for open deleted files still consuming space
lsof +L1 | grep 'deleted'
# Check disk I/O by process
iotop -o
# Find recently modified large files
find / -type f -size +10M -mtime -7 -ls 2>/dev/null
|
Service Issues
| Symptom | Possible Causes | Investigation Commands | Common Solutions |
|---|
| Service won’t start | Misconfiguration | systemctl status <service> | Fix configuration |
| | Dependencies | systemctl list-dependencies <service> | Ensure dependencies run |
| | Permission issues | journalctl -u <service> | Fix permissions |
| Service crashes | Resource limits | ulimit -a | Adjust ulimit |
| | Bugs | journalctl -u <service> -p err | Update software |
| | Incompatibilities | ldd $(which <binary>) | Check library dependencies |
Example: Service Troubleshooting
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Check service status
systemctl status nginx
# View service logs
journalctl -u nginx --since "1 hour ago"
# Check configuration syntax
nginx -t
# Examine service dependencies
systemctl list-dependencies nginx
# Check process limits
cat /proc/$(pidof nginx)/limits
# Check file descriptor usage
lsof -p $(pidof nginx) | wc -l
|
📊 Log Analysis Techniques
Logs are the first place professionals look when troubleshooting. Knowing how to effectively analyze logs is a core skill.
Key Log Files and Their Purpose
| Log File | Content | Common Issues |
|---|
/var/log/syslog or /var/log/messages | General system messages | System-wide issues |
/var/log/auth.log or /var/log/secure | Authentication events | Login failures, security |
/var/log/kern.log | Kernel messages | Hardware, driver issues |
/var/log/dmesg | Boot-time messages | Boot problems, hardware detection |
/var/log/apache2/ or /var/log/httpd/ | Web server logs | Web application issues |
/var/log/mysql/ | Database logs | Database performance, errors |
/var/log/apt/ or /var/log/yum.log | Package management | Installation problems |
/var/log/fail2ban.log | Intrusion prevention | Security violations |
Effective Log Filtering
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Find error messages
grep -i error /var/log/syslog
# Show context around errors
grep -i -A3 -B2 "fatal" /var/log/apache2/error.log
# Filter by timestamp
grep "Apr 18 10:" /var/log/auth.log
# Filter by multiple patterns
grep -E "error|warning|critical" /var/log/syslog
# Filter out noise
grep -v "irrelevant pattern" /var/log/application.log
# Find recent authentication failures
grep "Failed password" /var/log/auth.log | tail -20
|
1
2
3
4
5
6
7
8
9
10
11
12
| # Use journalctl for systemd logs
journalctl -u nginx --since "1 hour ago"
journalctl -p err..emerg --since today
# Use logwatch for summary reports
logwatch --service apache --range yesterday --detail high
# Use multitail to watch multiple logs
multitail /var/log/nginx/error.log /var/log/mysql/error.log
# Use lnav for interactive log viewing
lnav /var/log/syslog /var/log/auth.log
|
Tip: Regularly reviewing logs even when there are no apparent issues helps you understand what “normal” looks like. This baseline knowledge is invaluable when troubleshooting.
System performance issues require a structured approach to identify bottlenecks.
- Establish a baseline
- Document normal performance metrics
- Use historical data if available
- Identify symptoms
- Slowness during specific operations
- High load averages
- Poor response times
- Check the four key resources
- Analyze processes
- Which processes are consuming resources
- Are there unexpected processes
- Process relationships (parent-child)
- Examine detailed metrics
- System calls
- Context switches
- File descriptors
- Thread counts
| Tool | Purpose | Example Usage |
|---|
top/htop | Real-time process monitoring | htop -d 5 |
vmstat | Virtual memory statistics | vmstat 5 10 |
iostat | I/O statistics | iostat -xz 5 10 |
mpstat | Multi-processor statistics | mpstat -P ALL 5 10 |
sar | System activity reporter | sar -n DEV 5 10 |
strace | Trace system calls | strace -p <pid> |
perf | Performance analysis | perf top -p <pid> |
netstat/ss | Network statistics | ss -tunapl |
iotop | I/O monitoring | iotop -oPa |
nmon | Performance monitoring | nmon |
CPU Profiling
1
2
3
4
5
6
7
8
9
10
| # Basic CPU profiling
perf record -g -p <pid> -- sleep 30
perf report
# CPU flame graph (requires FlameGraph tools)
perf record -g -p <pid> -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu-flame.svg
# Process CPU time breakdown
pidstat -t -p <pid> 1 10
|
Memory Profiling
1
2
3
4
5
6
7
8
9
| # Memory usage patterns
valgrind --tool=massif --massif-out-file=massif.out <program>
ms_print massif.out
# Process memory maps
pmap -x <pid>
# Track memory allocations
mtrace <program>
|
Disk I/O Profiling
1
2
3
4
5
6
7
8
| # Disk I/O by process
iotop -aoP
# File system latency
ioping -c 10 /path/to/directory
# Block I/O tracing
blktrace -d /dev/sda -o - | blkparse -i -
|
Network Profiling
1
2
3
4
5
6
7
8
9
| # Network traffic by process
nethogs
# Packet capture and analysis
tcpdump -i eth0 port 80 -w capture.pcap
wireshark capture.pcap
# Network connections by process
ss -tp
|
Note: Many of these specialized tools may need to be installed separately with your package manager.
🌐 Network Diagnostics
Network issues are common and can be challenging to troubleshoot without a systematic approach.
Network Troubleshooting Checklist
- Verify physical connectivity
- Check cables, link lights, etc.
- Verify interface status with
ip link
- Check IP configuration
- Proper IP address, subnet, gateway
- DNS server configuration
- Test basic connectivity
- Local network with
ping - External networks with
traceroute/mtr
- Check network services
- Service status and ports
- Firewall rules
- Analyze packet flow
- Capture and analyze traffic
- Check for packet loss or latency
Essential Network Diagnostic Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # Check network interfaces
ip addr
ip link
# Check routing
ip route
route -n
# DNS resolution
nslookup google.com
dig google.com
# Connectivity testing
ping -c 4 8.8.8.8
traceroute google.com
mtr google.com
# Port checking
nc -zv google.com 443
telnet google.com 443
# Open connections
ss -tunapl
netstat -tunapl
# Packet capture
tcpdump -i eth0 host 8.8.8.8
|
Common Network Issues and Solutions
| Symptom | Possible Causes | Investigation Commands | Common Solutions |
|---|
| No connectivity | Wrong IP/subnet | ip addr, ip route | Fix configuration |
| | Interface down | ip link | Bring up interface |
| | Routing issue | traceroute, mtr | Check/fix routing table |
| DNS problems | Wrong DNS servers | cat /etc/resolv.conf | Fix DNS configuration |
| | DNS service issues | dig @8.8.8.8 google.com | Try alternative DNS |
| | Caching issues | systemctl restart systemd-resolved | Restart DNS service |
| Can’t connect to service | Service down | systemctl status <service> | Start the service |
| | Firewall blocking | iptables -L, ufw status | Adjust firewall rules |
| | Wrong port | ss -tunapl \| grep <service> | Configure correct port |
Network Troubleshooting Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # Troubleshoot DNS
dig +trace google.com
host -v google.com
resolvectl status
# Troubleshoot routing
ip route get 8.8.8.8
traceroute -n 8.8.8.8
mtr -n 8.8.8.8
# Troubleshoot network performance
iperf3 -c iperf.server.com
ping -c 100 -i 0.2 8.8.8.8 | grep -oP '\d+\.\d+(?=/)' | awk '{sum+=$1} END {print sum/NR}'
# Check for packet loss
ping -c 100 8.8.8.8 | grep -oP '\d+(?=% packet loss)'
# Find which process is using a port
lsof -i :80
ss -tunapl | grep :80
|
Tip: When troubleshooting network issues, work from the lowest layer up: physical → link → network → transport → application.
📝 Documentation Practices
Professional troubleshooters document their processes and findings methodically.
Why Document?
- Future reference - Similar issues can recur
- Knowledge sharing - Help others learn from your experience
- Audit trail - Record what changed and why
- Process improvement - Identify patterns and systemic issues
- Handover - Enable others to continue your work
What to Document
| Beginner | Professional |
|---|
| Just the solution | Problem description, diagnosis steps, and solution |
| Commands used | Commands with expected and actual outputs |
| Simplified steps | Detailed process with reasoning |
| Personal notes | Shareable, searchable documentation |
| Without context | Including timestamps and system state |
Documentation Template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Incident Report: [Brief Description]
## Summary
- **Date/Time**: [When the issue occurred]
- **System(s) Affected**: [Specific hosts/services]
- **Impact**: [User/business impact]
- **Root Cause**: [Brief cause statement]
## Timeline
- **[Time]**: Issue detected [how it was detected]
- **[Time]**: Initial investigation began
- **[Time]**: [Investigation step]
- **[Time]**: Root cause identified
- **[Time]**: Solution implemented
- **[Time]**: Service restored
## Investigation Details
1. **Initial symptoms observed**:
|
[Output of initial commands/logs]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
2. **Diagnostic steps**:
- Checked [system component] using:
```
[command and output]
```
- Verified [condition] by:
```
[command and output]
```
3. **Root cause analysis**:
[Explanation of what caused the issue]
## Solution
1. **Immediate fix**:
|
[Commands/actions taken to resolve]
[Commands/checks performed to verify fix]
1
2
3
4
5
6
7
8
9
10
|
## Prevention
- **Monitoring**: [New monitoring/alerts implemented]
- **Automation**: [Automation to prevent recurrence]
- **Process**: [Process changes recommended]
## References
- [Relevant documentation links]
- [Knowledge base articles]
- [Similar past incidents]
|
Documentation Best Practices
- Use version control for configuration files and scripts
- Maintain a systems journal to track changes
- Create a knowledge base of common issues and solutions
- Use standard templates for different types of documentation
- Include context (what, why, when, how) in all documentation
- Make documentation searchable with tags and categories
- Review and update documentation regularly
Tip: Good documentation transforms individual knowledge into team knowledge, reducing “bus factor” risk.
When things go seriously wrong, professionals have a toolkit ready to recover systems.
Rescue Environments
| Tool | Description | When to Use |
|---|
| SystemRescue | Live Linux environment | File recovery, system repair |
| GParted Live | Partition management | Disk partitioning issues |
| Boot-Repair-Disk | Boot loader repair | GRUB/boot problems |
| Clonezilla | Disk cloning | System backup/restore |
| DBAN | Disk wiping | Secure data destruction |
Common Recovery Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # File system check
fsck -f /dev/sda1
# Repair boot loader
grub-install /dev/sda
update-grub
# Recover deleted files
extundelete /dev/sda1 --restore-file /path/to/file
# Rescue data from failing disk
ddrescue /dev/sda /dev/sdb rescue.log
# Mount filesystem read-only for inspection
mount -o ro /dev/sda1 /mnt/recovery
# Check disk for bad sectors
badblocks -v /dev/sda
|
Recovery Strategies
- Boot issues
- Use rescue disk to boot
- Check and repair boot loader
- Examine boot logs
- File system corruption
- Mount read-only if possible
- Run fsck on unmounted filesystems
- Check SMART status of disks
- Database recovery
- Use database-specific tools (e.g., mysqlcheck)
- Apply transaction logs
- Restore from backups
- Compromised systems
- Isolate the system
- Collect forensic evidence
- Scan for malware
- Rebuild from known good state
Recovery Plan Template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| # 1. Boot from rescue media
# [Instructions for specific rescue disk]
# 2. Mount filesystem read-only
mount -o ro /dev/sda1 /mnt/recover
# 3. Back up critical data
mkdir /mnt/backup
rsync -av /mnt/recover/home/ /mnt/backup/
# 4. Check filesystem
umount /mnt/recover
fsck -f /dev/sda1
# 5. Repair system files
mount /dev/sda1 /mnt/recover
chroot /mnt/recover
# 6. Repair boot loader
grub-install /dev/sda
update-grub
# 7. Verify critical services
systemctl list-unit-files --state=enabled
|
Warning: Always back up data before attempting system recovery. Test recovery procedures on non-critical systems before applying to production.
🔎 Preventative Measures
Professionals know that preventing problems is more efficient than fixing them.
| Tool | Description | What to Monitor |
|---|
| Nagios | Network/service monitoring | Service availability, performance |
| Prometheus | Metrics and alerting | System metrics, custom metrics |
| Grafana | Metrics visualization | Dashboards for all metrics |
| ELK Stack | Log aggregation/analysis | Centralized log management |
| Zabbix | Enterprise monitoring | Infrastructure monitoring |
| Netdata | Real-time monitoring | Low-level system metrics |
Preventative Checks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Scheduled disk checks
tune2fs -l /dev/sda1 | grep 'Mount count'
# Check for updates
apt update && apt list --upgradable
# Check disk space growth trends
df -h >> /var/log/disk_usage.log
# Check for failed services
systemctl --failed
# Verify backups
find /backup/ -name "*.tar.gz" -mtime -1 | wc -l
# Check for compromised packages
debsums -c
|
Configuration Management
Using configuration management tools helps prevent configuration drift:
- Ansible - Agentless configuration management
- Puppet - Declarative configuration management
- Chef - Infrastructure as code
- Salt - Event-driven automation
- Terraform - Infrastructure provisioning
Regular Maintenance Schedule
| Interval | Task | Command Example |
|---|
| Daily | Check disk space | df -h |
| | Review logs | journalctl -p err -s yesterday |
| | Verify backups | ls -l /backup/daily/ |
| Weekly | Update packages | apt update && apt upgrade |
| | Check service status | systemctl list-units --state=failed |
| | Performance baseline | sar -A > /var/log/performance/weekly.log |
| Monthly | Full security scan | lynis audit system |
| | User account audit | getent passwd \| sort |
| | Filesystem check | tune2fs -C 999 /dev/sda1 |
| Quarterly | Disaster recovery test | [Restore test procedure] |
| | Performance review | compare baseline metrics |
| | Security audit | [Security checklist] |
Tip: Automate as many maintenance tasks as possible. Human-performed maintenance should focus on reviewing automated reports and addressing exceptions.
📋 Log Management and Standardization
Effective log management goes beyond just viewing logs when problems occur.
Log Rotation - Beyond Default Settings
| Beginner | Professional |
|---|
| Uses default log rotation settings | Customizes rotation based on system purpose |
| Ignores log files until disk space issues | Proactively manages log growth |
| Manually deletes old logs | Configures automated rotation policies |
| One-size-fits-all approach | Service-specific rotation strategies |
Key Log Rotation Parameters
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Example logrotate configuration
/var/log/myapp/*.log {
daily # Rotate frequency
rotate 14 # Keep 14 rotated logs
compress # Compress rotated logs
delaycompress # Delay compression by one cycle
missingok # Don't error if log is missing
notifempty # Don't rotate empty logs
create 0640 www-data # Create new log with permissions
sharedscripts # Run scripts once per rotation
postrotate
systemctl reload myapp
endscript
}
|
Professional logging includes standardized formats with these elements:
- Timestamps - ISO 8601 format (YYYY-MM-DDTHH:MM:SS.sss±ZZZZ)
- Severity levels - Use standard levels (ERROR, WARN, INFO, DEBUG)
- Source identification - Service, module, or function
- Correlation IDs - To track related events across services
- Structured data - JSON or similar for machine parsing
- Context - User IDs, request IDs, or other relevant context
1
| 2025-04-18T10:30:45.123+0530 INFO [web-server] [request-id: 1a2b3c4d] User authenticated: user_id=123, source_ip=192.168.1.100, result=success
|
Application Logging Best Practices
Configure applications to use these logging best practices:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Example rsyslog configuration
# /etc/rsyslog.d/myapp.conf
template(name="JsonFormat" type="list") {
constant(value="{")
constant(value="\"timestamp\":\"") property(name="timereported" dateFormat="rfc3339")
constant(value="\",\"host\":\"") property(name="hostname")
constant(value="\",\"severity\":\"") property(name="syslogseverity-text")
constant(value="\",\"facility\":\"") property(name="syslogfacility-text")
constant(value="\",\"tag\":\"") property(name="syslogtag" format="json")
constant(value="\",\"message\":\"") property(name="msg" format="json")
constant(value="\"}")
}
# Send myapp logs to a dedicated file in JSON format
if $programname == 'myapp' then {
action(type="omfile" file="/var/log/myapp/application.log" template="JsonFormat")
}
|
Centralized Logging
Professionals often implement centralized logging for easier analysis:
- Collection - Filebeat, Fluentd, Logstash
- Storage - Elasticsearch, Loki, Graylog
- Analysis - Kibana, Grafana
- Alerting - ElastAlert, Grafana alerts
Example Filebeat Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
| # /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/myapp/*.log
json.keys_under_root: true
json.add_error_key: true
json.message_key: message
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "myapp-%{+YYYY.MM.dd}"
|
Tip: Design logs for both human readability and machine parsing. Structured logs are easier to search, filter, and analyze at scale.
📌 Final Thought
“Problems are inevitable; being unprepared is optional.”
Effective troubleshooting is not just a technical skill—it’s a mindset. By developing a systematic approach to diagnosing and resolving issues, you transform from someone who fights fires to someone who manages systems proactively.
The difference between a beginner and a professional is not just knowledge of commands or tools, but the discipline to follow methodical processes, document thoroughly, and learn from each incident.
Remember that the best troubleshooters are those who:
- Understand their systems deeply
- Follow structured approaches
- Document everything
- Learn from every incident
- Implement preventative measures
This checklist is your roadmap to developing that professional troubleshooting mindset.
Related Articles