Post

Linux System Internals

Linux System Internals

🖥️ Linux System Internals — Understanding the Engine Under the Hood


🎯 What You Will Learn

  • What actually happens inside the OS when you type and run a command in a terminal
  • What a process is, how it’s represented, and how it differs from a program
  • What a system call (syscall) is and why it’s the bridge between user space and kernel space
  • What file descriptors are and why almost everything in Linux is treated as a file
  • How /proc and /dev expose the kernel’s internal state as navigable file trees

📝 Topic Overview


🔹 What Happens When You Run a Command?

When you type ls -la and press Enter, a multi-step process unfolds across several layers of the operating system.

Step-by-step breakdown:

  1. The shell (e.g., bash, zsh) reads your input and parses it into a command name and arguments.
  2. The shell calls fork() — a syscall that creates a child process as a near-identical copy of the shell itself.
  3. In the child process, the shell calls execve() — another syscall — to replace the child’s memory image with the ls binary found in $PATH (typically /usr/bin/ls).
  4. The kernel loads the ELF binary into memory, sets up the stack, heap, and text segments, and begins execution.
  5. ls runs, makes syscalls like getdents64() to read directory entries, and writes output via write() to stdout (file descriptor 1).
  6. The parent shell calls wait(), blocking until the child exits. On exit, the child’s resources are cleaned up and its exit code is returned to the shell.

Key insight: You never “run” a program directly — the shell always forks itself first, then replaces the forked child with the target binary. This is the fork-exec pattern, the cornerstone of Unix process creation.


🔹 What Is a Process?

A process is an instance of a running program — a program in execution, complete with its own isolated resources.

Every process has:

  • A unique PID (Process ID)
  • Its own virtual address space (stack, heap, text/code, data segments)
  • File descriptors (open files, sockets, pipes)
  • A process state: running, sleeping, zombie, stopped, etc.
  • A parent process (every process except PID 1 has one)

Program vs Process: A program is a static file on disk (e.g., /usr/bin/python3). A process is that program actively loaded into memory and running. You can have 10 Python processes all spawned from the same binary.

1
2
3
4
5
6
7
8
9
# View all running processes with full details
ps aux

# View process tree (shows parent-child relationships)
pstree -p

# Monitor processes in real time
top
htop  # more user-friendly

🔹 What Is a Syscall?

A system call (syscall) is a controlled entry point into the kernel — the mechanism user-space programs use to request privileged operations from the OS.

Modern CPUs enforce a hard boundary between user space (where applications run) and kernel space (where the OS kernel runs). User code cannot directly touch hardware, allocate memory from the kernel, or read another process’s memory. To do any of these, it must ask the kernel via a syscall.

Common syscalls you use every day:

SyscallWhat It Does
fork()Create a child process
execve()Replace current process with a new program
open()Open a file, return a file descriptor
read()Read bytes from a file descriptor
write()Write bytes to a file descriptor
close()Close a file descriptor
mmap()Map files or memory into the process address space
exit()Terminate the current process
wait()Wait for a child process to finish
socket()Create a network socket

How it works mechanically: The program places the syscall number in a CPU register (e.g., rax on x86-64), puts arguments in other registers, then executes a special CPU instruction (syscall on x86-64). The CPU switches to kernel mode, the kernel dispatches to the appropriate handler, executes it, and returns the result.

1
2
3
4
5
# Trace all syscalls made by a running command
strace ls -la

# Count syscall frequency
strace -c ls -la

Analogy: Syscalls are like a restaurant menu — user programs can only order from the menu (the defined syscall interface). They cannot walk into the kitchen (kernel space) directly.


🔹 What Is a File Descriptor?

A file descriptor (FD) is a non-negative integer that represents an open I/O resource within a process. In Linux, almost everything — regular files, directories, pipes, sockets, terminals, devices — is accessed through file descriptors.

Standard file descriptors (always pre-opened):

FDNameDefault Target
0stdinKeyboard (terminal input)
1stdoutTerminal output
2stderrTerminal error output

When you call open(), the kernel returns the lowest available FD integer (starting from 3). Reading and writing then use that integer via read(fd, ...) and write(fd, ...).

1
2
3
4
5
# See all open file descriptors for a process (replace PID)
ls -la /proc/<PID>/fd

# Example: see FDs for your shell
ls -la /proc/$$/fd

“Everything is a file” is one of Unix’s foundational philosophies. A network socket, a hardware device, even inter-process communication via pipes — all use the same open/read/write/close interface. This uniformity is what makes shell pipelines (cmd1 | cmd2) and I/O redirection (cmd > file) so powerful.


🔹 What Is /proc?

/proc is a virtual filesystem (not on disk) — a window into the live state of the kernel and all running processes, presented as a directory tree.

1
2
3
4
5
6
7
8
9
ls /proc
# Output: numbered dirs (one per PID), plus: cpuinfo, meminfo, uptime, version, etc.

cat /proc/cpuinfo       # CPU model, cores, flags
cat /proc/meminfo       # RAM usage and breakdown
cat /proc/uptime        # System uptime in seconds
cat /proc/version       # Kernel version string
cat /proc/$$/status     # Status of the current shell process
cat /proc/$$/maps       # Virtual memory map of the current process

Each numbered directory (e.g., /proc/1234) represents process 1234, containing: cmdline, environ, fd/, maps, status, stat, and more.


🔹 What Is /dev?

/dev is a directory containing device files — special files that represent hardware and virtual devices. Interacting with a device is as simple as reading or writing to its file.

Device FileDescription
/dev/sda, /dev/nvme0n1Block storage (hard drive, SSD)
/dev/nullDiscards all writes; reads return EOF
/dev/zeroReads return an endless stream of zero bytes
/dev/random, /dev/urandomCryptographically secure random bytes
/dev/ttyThe process’s controlling terminal
/dev/stdin, /dev/stdoutSymlinks to FD 0 and 1
1
2
3
4
5
# Silence output by redirecting to the void
rm important_file 2>/dev/null

# Fill a file with zeros (e.g., create a 1MB blank file)
dd if=/dev/zero of=blank.bin bs=1M count=1

🔹 How Does the Kernel Schedule Tasks?

The Linux scheduler determines which process runs on which CPU core at any given moment. Linux uses the Completely Fair Scheduler (CFS) as its default scheduler (introduced in kernel 2.6.23).

Core concepts:

  • The scheduler maintains a red-black tree of runnable processes, sorted by virtual runtime (vruntime) — how long each process has run weighted by its priority.
  • The process with the smallest vruntime always runs next — ensuring every process gets a fair share of CPU time.
  • Nice values (-20 to +19) adjust priority. Lower nice = higher priority. Default is 0.
  • Processes can be preempted — the kernel can forcibly interrupt a running process to give CPU time to another.
1
2
3
4
5
6
7
8
# Run a command with lower priority (nicer to other processes)
nice -n 10 ./my_script.sh

# Change priority of a running process
renice +5 -p <PID>

# View scheduler stats per process
cat /proc/<PID>/sched

đź’ˇ References & Learning Resources

  • “The Linux Programming Interface” by Michael Kerrisk — the definitive reference (Advanced/Deep dive)
  • “Linux Kernel Development” by Robert Love — internals explained clearly (Intermediate)
  • man 2 syscalls — complete syscall list from the Linux man pages (Beginner-friendly)
  • “Operating Systems: Three Easy Pieces” (ostep.org) — free online OS textbook (Beginner-friendly)
  • Linux kernel source: https://elixir.bootlin.com/linux/latest/source (Advanced/Deep dive)

📊 Quick Recap

  • Every command runs via the fork-exec pattern: the shell forks a child, then the child execs the target binary.
  • A process is a running instance of a program with its own PID, memory space, and file descriptors.
  • Syscalls are the only safe, controlled gateway from user-space code into the kernel — triggered by a CPU instruction that switches privilege levels.
  • File descriptors are integers that abstract all I/O resources; FDs 0, 1, 2 are stdin, stdout, and stderr.
  • /proc is a live, in-memory pseudo-filesystem exposing the kernel’s view of every process and system state.
  • /dev contains device files — reading/writing them interacts with hardware or virtual devices like /dev/null and /dev/urandom.
  • The CFS scheduler uses virtual runtime on a red-black tree to ensure fair CPU time distribution; nice values tune priority.

🏷️ Tags

1
#Linux #SystemInternals #Kernel #Process #Syscall #FileDescriptor #proc #dev #CFS #Scheduler #UnixPhilosophy #OperatingSystems #CLI #Intermediate #ComputerScience
This post is licensed under CC BY 4.0 by the author.