Files and Directories — Operating Systems

What it is#

A file in UNIX is a linear sequence of bytes with a name and some metadata. A directory is a special file whose contents are name → inode mappings. Together they form a tree (or, with hard links, a DAG of byte sequences with multiple names). The abstraction is uniform across magnetic disks, SSDs, RAM-backed tmpfs, network mounts, and special pseudo-file systems like /proc and /sys.

This uniformity is one of UNIX’s enduring design wins. A program that reads bytes from stdin doesn’t care if stdin is a file on disk, a pipe from another process, a TTY, or a network socket — all of them are file descriptors, and read() works the same way.

When to use it#

You don’t “choose” the file abstraction — it’s the default I/O surface on every UNIX-derived system. The interesting choices are around how to use it:

Files vs databases. Files are the right answer for blobs (images, video, logs, configs, source code, build artefacts) and for anything you want a human or another tool to read with cat. Databases win once you need indexed access, atomic multi-record updates, or many concurrent writers.
Hard links vs symlinks. Hard links share the same inode and reference count; deletion is reference-counted. Symlinks store a path string and resolve at access time; they break when the target moves. Use hard links for deduplication and atomic publishing, symlinks for human-readable shortcuts.
fsync() discipline. Whether your file’s contents survive a crash depends entirely on calling fsync at the right time. The default write() only buffers in the page cache.

How it works#

The inode is the actual file#

A directory entry maps name → inode number. The inode is the file. It stores:

Type (regular, directory, symlink, device, FIFO, socket).
Permission bits (rwxrwxrwx for owner, group, others; plus setuid, setgid, sticky).
Owner UID and GID.
Size in bytes.
Timestamps — mtime (modify), atime (access), ctime (inode change).
Link count — how many directory entries point at this inode.
Block pointers — direct, indirect, double-indirect, triple-indirect (more in File System Implementation).

When you rm a file, you remove the directory entry; the inode is reclaimed only when the link count drops to zero and no process has the file open. That second condition is why you can rm /var/log/big.log while a daemon is writing to it — the inode persists until the daemon closes its file descriptor.

File descriptors#

A process opens a file with open(path, flags) and gets back a small integer — the file descriptor (fd). The kernel keeps a per-process table of open files; index fd points at a kernel-side struct file which holds the inode reference, the current offset, and the access mode.

process fd table         kernel open-file table        inode cache
[0] stdin     ───>  struct file { inode=...,off=0 }─>  inode 42
[1] stdout    ───>  struct file { inode=...,off=0 }─>  inode 43
[2] stderr    ───>  struct file { inode=...,off=0 }─>  inode 43  (shared!)
[3] data.log  ───>  struct file { inode=...,off=0 }─>  inode 7891

dup2(fd, newfd) lets a child process redirect stdout into a file — shells use this to implement > out.txt. pipe() returns two fds that are connected to each other — shells use this for |.

Hard links vs symbolic links#

A hard link is just another directory entry pointing at the same inode. The inode’s link count goes up. There’s no notion of “the original” — all hard links are equal.

ln a.txt b.txt   # both names now point at the same inode; link count = 2
rm a.txt         # link count = 1, file still accessible as b.txt
rm b.txt         # link count = 0, blocks freed

A symbolic link is a special inode whose contents are a path string. When you open() a symlink, the kernel resolves the path stored in it.

ln -s /usr/bin/python3 mypy   # mypy is a 18-byte inode containing "/usr/bin/python3"
mv /usr/bin/python3 /usr/bin/py3   # mypy now dangles

Permissions#

The classical mode bits — rwxrwxrwx plus three special bits — encode access as (user, group, other). The setuid bit on an executable makes it run as the file’s owner (the classical example: /usr/bin/passwd runs as root so it can edit /etc/shadow). sticky on a directory makes it so only the owner of a file can delete it (used on /tmp).

Modern systems layer on ACLs (POSIX setfacl / getfacl) for finer-grained permissions, capabilities (Linux cap_*) to drop the all-or-nothing setuid model, and mandatory access control (SELinux, AppArmor) for policy beyond owner+permission bits.

`fsync()`, mount, and the path namespace#

fsync(fd) forces all dirty data and metadata for that file to durable storage. Cost: one or more disk operations; can be milliseconds on HDDs, tens of microseconds on NVMe.
mount(source, target, fstype, ...) attaches a file system at a path. The same source can be mounted in multiple places; mount --bind does this without re-reading the underlying device. Containers rely heavily on bind mounts.
The path namespace is per-process on Linux (unshare(CLONE_NEWNS)); a process can have its own / that’s invisible to others. This is the foundation of container isolation.

Variants#

Regular files vs special files#

Block devices (/dev/sda) — random-access devices, buffered through the page cache.
Character devices (/dev/tty, /dev/random) — sequential, unbuffered, byte-stream.
FIFOs (named pipes) — in-memory queue with a name on the file system.
UNIX domain sockets — local IPC with a path.
Pseudo-files — /proc/$pid/status, /sys/class/net/eth0/.... Kernel data structures presented as files.

Path resolution#

open("/a/b/c") walks the path component by component: read /’s inode, look up a in its directory entries, read a’s inode, look up b, …. Each step can miss the page cache and cost a disk read. The kernel’s dcache (directory entry cache) is critical to making this fast on hot paths.

File holes (sparse files)#

Writing at offset 1 GB into a fresh 0-byte file produces a 1-GB file that occupies almost no disk space. The intervening blocks have no on-disk allocation; reads return zeros. cp typically preserves holes; cp --sparse=never materialises them.

Trade-offs#

Files as the universal interface — simple, portable, scriptable. Any tool that can read bytes can read your data. Cost: no schema, no atomicity beyond a single rename, weak indexing (you grep through the file).

Database as the universal interface — rich queries, transactions across many records, indexes. Cost: schema rigidity, network hop (for client-server DBs), backup complexity, can’t cat it.

Atomic rename. rename(src, dst) on the same file system is atomic at the directory-entry level. The “write to tmp, fsync, rename” pattern is how applications safely overwrite a config file — readers always see either the old or the new contents, never a partial write.
fsync vs fdatasync. fdatasync skips flushing inode metadata that doesn’t affect data integrity (mtime, atime). Marginally faster; rarely worth the complexity outside high-IO databases.
O_DIRECT. Bypasses the page cache. Reads and writes go straight to the device. Databases that manage their own buffer pool use this to avoid double-caching. Cost: the application must do its own alignment and buffering.
Page cache. All read()/write() traffic goes through the kernel page cache by default. Hot files stay in RAM; modifications are written back lazily. Without fsync, a hot file modified entirely in cache survives no crashes.

Common pitfalls#

rm doesn’t free space if a process has the file open. Daemon writing to /var/log/huge.log, you rm it, df still shows the disk full. lsof | grep deleted to find offenders; restart the daemon.
Writing without fsync. “I called write() and the call returned, why did I lose my data?” — because write() only copies into the page cache. See Crash Consistency — fsck and Journaling.
Forgetting to fsync the directory. Creating a new file and fsync-ing the file is not enough — you also need fsync(parent_dir_fd) so the directory entry is durable.
Path-based race conditions (TOCTOU). stat("/tmp/foo") then open("/tmp/foo") — between the two calls, the path can be replaced. Use *at() calls (openat, fstatat) anchored to a directory fd, or O_NOFOLLOW.
Treating mtime as a reliable cache key. mtime has 1-second resolution on some file systems, and applications can set it arbitrarily with utimes(). Build systems that rely on mtime ordering (Make) occasionally misfire — content-hash caches (Bazel, Ninja’s deps) are more robust.
Hard-linking across file systems. Fails — hard links live within one inode space. Symlinks cross-file-system fine.

Why does Linux still distinguish atime from mtime in 2026?

POSIX requires it. The find -atime and tmpwatch style of “delete files not read in N days” depends on atime. Most systems mount with relatime so atime is only updated when it’s older than mtime — that catches the common “has this been touched since last write” question without paying for an atime update on every read. Disabling atime entirely (noatime) is a common database-server tweak.

File System Implementation — how the inode and directory abstractions are actually laid out on disk.
The Fast File System (FFS) — the canonical disk-aware implementation.
Crash Consistency — fsck and Journaling — what protects the file system from torn writes.
Hard Disk Drives — the substrate that shaped the inode layout.
I/O Devices and Drivers — what’s under the file system.