Files and Directories
The UNIX file abstraction, descriptors, inodes, hard vs symbolic links, fsync, mount, and the permission-bit model.
What it is#
A file in UNIX is a linear sequence of bytes with a name and some metadata. A directory is a special file whose contents are name → inode mappings. Together they form a tree (or, with hard links, a DAG of byte sequences with multiple names). The abstraction is uniform across magnetic disks, SSDs, RAM-backed tmpfs, network mounts, and special pseudo-file systems like /proc and /sys.
This uniformity is one of UNIX’s enduring design wins. A program that reads bytes from stdin doesn’t care if stdin is a file on disk, a pipe from another process, a TTY, or a network socket — all of them are file descriptors, and read() works the same way.
When to use it#
You don’t “choose” the file abstraction — it’s the default I/O surface on every UNIX-derived system. The interesting choices are around how to use it:
- Files vs databases. Files are the right answer for blobs (images, video, logs, configs, source code, build artefacts) and for anything you want a human or another tool to read with
cat. Databases win once you need indexed access, atomic multi-record updates, or many concurrent writers. - Hard links vs symlinks. Hard links share the same inode and reference count; deletion is reference-counted. Symlinks store a path string and resolve at access time; they break when the target moves. Use hard links for deduplication and atomic publishing, symlinks for human-readable shortcuts.
fsync()discipline. Whether your file’s contents survive a crash depends entirely on calling fsync at the right time. The defaultwrite()only buffers in the page cache.
How it works#
The inode is the actual file#
A directory entry maps name → inode number. The inode is the file. It stores:
- Type (regular, directory, symlink, device, FIFO, socket).
- Permission bits (
rwxrwxrwxfor owner, group, others; plussetuid,setgid,sticky). - Owner UID and GID.
- Size in bytes.
- Timestamps —
mtime(modify),atime(access),ctime(inode change). - Link count — how many directory entries point at this inode.
- Block pointers — direct, indirect, double-indirect, triple-indirect (more in File System Implementation).
When you rm a file, you remove the directory entry; the inode is reclaimed only when the link count drops to zero and no process has the file open. That second condition is why you can rm /var/log/big.log while a daemon is writing to it — the inode persists until the daemon closes its file descriptor.
File descriptors#
A process opens a file with open(path, flags) and gets back a small integer — the file descriptor (fd). The kernel keeps a per-process table of open files; index fd points at a kernel-side struct file which holds the inode reference, the current offset, and the access mode.
process fd table kernel open-file table inode cache[0] stdin ───> struct file { inode=...,off=0 }─> inode 42[1] stdout ───> struct file { inode=...,off=0 }─> inode 43[2] stderr ───> struct file { inode=...,off=0 }─> inode 43 (shared!)[3] data.log ───> struct file { inode=...,off=0 }─> inode 7891dup2(fd, newfd) lets a child process redirect stdout into a file — shells use this to implement > out.txt. pipe() returns two fds that are connected to each other — shells use this for |.
Hard links vs symbolic links#
A hard link is just another directory entry pointing at the same inode. The inode’s link count goes up. There’s no notion of “the original” — all hard links are equal.
ln a.txt b.txt # both names now point at the same inode; link count = 2rm a.txt # link count = 1, file still accessible as b.txtrm b.txt # link count = 0, blocks freedA symbolic link is a special inode whose contents are a path string. When you open() a symlink, the kernel resolves the path stored in it.
ln -s /usr/bin/python3 mypy # mypy is a 18-byte inode containing "/usr/bin/python3"mv /usr/bin/python3 /usr/bin/py3 # mypy now danglesPermissions#
The classical mode bits — rwxrwxrwx plus three special bits — encode access as (user, group, other). The setuid bit on an executable makes it run as the file’s owner (the classical example: /usr/bin/passwd runs as root so it can edit /etc/shadow). sticky on a directory makes it so only the owner of a file can delete it (used on /tmp).
Modern systems layer on ACLs (POSIX setfacl / getfacl) for finer-grained permissions, capabilities (Linux cap_*) to drop the all-or-nothing setuid model, and mandatory access control (SELinux, AppArmor) for policy beyond owner+permission bits.
fsync(), mount, and the path namespace#
fsync(fd)forces all dirty data and metadata for that file to durable storage. Cost: one or more disk operations; can be milliseconds on HDDs, tens of microseconds on NVMe.mount(source, target, fstype, ...)attaches a file system at a path. The same source can be mounted in multiple places;mount --binddoes this without re-reading the underlying device. Containers rely heavily on bind mounts.- The path namespace is per-process on Linux (
unshare(CLONE_NEWNS)); a process can have its own/that’s invisible to others. This is the foundation of container isolation.
Variants#
Regular files vs special files#
- Block devices (
/dev/sda) — random-access devices, buffered through the page cache. - Character devices (
/dev/tty,/dev/random) — sequential, unbuffered, byte-stream. - FIFOs (named pipes) — in-memory queue with a name on the file system.
- UNIX domain sockets — local IPC with a path.
- Pseudo-files —
/proc/$pid/status,/sys/class/net/eth0/.... Kernel data structures presented as files.
Path resolution#
open("/a/b/c") walks the path component by component: read /’s inode, look up a in its directory entries, read a’s inode, look up b, …. Each step can miss the page cache and cost a disk read. The kernel’s dcache (directory entry cache) is critical to making this fast on hot paths.
File holes (sparse files)#
Writing at offset 1 GB into a fresh 0-byte file produces a 1-GB file that occupies almost no disk space. The intervening blocks have no on-disk allocation; reads return zeros. cp typically preserves holes; cp --sparse=never materialises them.
Trade-offs#
cat it. - Atomic rename.
rename(src, dst)on the same file system is atomic at the directory-entry level. The “write to tmp, fsync, rename” pattern is how applications safely overwrite a config file — readers always see either the old or the new contents, never a partial write. - fsync vs fdatasync.
fdatasyncskips flushing inode metadata that doesn’t affect data integrity (mtime,atime). Marginally faster; rarely worth the complexity outside high-IO databases. O_DIRECT. Bypasses the page cache. Reads and writes go straight to the device. Databases that manage their own buffer pool use this to avoid double-caching. Cost: the application must do its own alignment and buffering.- Page cache. All
read()/write()traffic goes through the kernel page cache by default. Hot files stay in RAM; modifications are written back lazily. Withoutfsync, a hot file modified entirely in cache survives no crashes.
Common pitfalls#
rmdoesn’t free space if a process has the file open. Daemon writing to/var/log/huge.log, yourmit,dfstill shows the disk full.lsof | grep deletedto find offenders; restart the daemon.- Writing without fsync. “I called
write()and the call returned, why did I lose my data?” — becausewrite()only copies into the page cache. See Crash Consistency — fsck and Journaling. - Forgetting to fsync the directory. Creating a new file and
fsync-ing the file is not enough — you also needfsync(parent_dir_fd)so the directory entry is durable. - Path-based race conditions (TOCTOU).
stat("/tmp/foo")thenopen("/tmp/foo")— between the two calls, the path can be replaced. Use*at()calls (openat,fstatat) anchored to a directory fd, orO_NOFOLLOW. - Treating mtime as a reliable cache key.
mtimehas 1-second resolution on some file systems, and applications can set it arbitrarily withutimes(). Build systems that rely on mtime ordering (Make) occasionally misfire — content-hash caches (Bazel, Ninja’s deps) are more robust. - Hard-linking across file systems. Fails — hard links live within one inode space. Symlinks cross-file-system fine.
Why does Linux still distinguish atime from mtime in 2026?
POSIX requires it. The find -atime and tmpwatch style of “delete files not read in N days” depends on atime. Most systems mount with relatime so atime is only updated when it’s older than mtime — that catches the common “has this been touched since last write” question without paying for an atime update on every read. Disabling atime entirely (noatime) is a common database-server tweak.
Related building blocks#
- File System Implementation — how the inode and directory abstractions are actually laid out on disk.
- The Fast File System (FFS) — the canonical disk-aware implementation.
- Crash Consistency — fsck and Journaling — what protects the file system from torn writes.
- Hard Disk Drives — the substrate that shaped the inode layout.
- I/O Devices and Drivers — what’s under the file system.