NFS — Network File System

Stateless protocol, idempotent operations, client-side caching, the cache-consistency problem, server-write buffering.

System Intermediate
9 min read
nfs distributed-file-systems sun-rpc statelessness
Companies this resembles: Sun Microsystems

What it is#

NFS (Network File System) is the distributed file system Sun Microsystems published in 1984 and which became the dominant way to share files across UNIX hosts on a LAN. Its v2 protocol is short — about a dozen RPC operations — and is famous as the textbook example of a stateless protocol with idempotent operations. Mount an NFS share and /home/alice on your workstation is actually a directory on a server somewhere; reads and writes turn into RPCs over UDP (later TCP).

The design choice that made NFS robust beyond its peers was statelessness: the server keeps no per-client memory. If it reboots mid-session, clients don’t notice anything beyond a few extra retries. That choice rippled into everything else — caching, write semantics, and consistency all derive from it.

Architecture overview#

user process (cd, cat, vi)
|
system calls (open, read, write, close)
|
VFS layer in client kernel
|
NFS client
| Sun RPC over UDP (later TCP), port 2049
v
NFS server
|
local file system (UFS, ext, ZFS, ...)
|
disk

A client mounts a remote export. Every file operation that crosses into that subtree goes through the NFS client, which translates VFS calls into NFS RPCs. The server is just a long-running process that receives RPCs and executes them against its local file system. Authentication is by UID — the server trusts the client to assert the user identity (the classic NFS security weakness).

The protocol’s core type is the file handle: an opaque ~32-byte token that encodes (file system ID, inode number, generation number). The client gets handles via LOOKUP (name → handle, like a stat) and MOUNT (root of an export → handle). Every subsequent operation carries the handle in the request, so the server can find the file without remembering anything between calls.

Protocol and statelessness#

The v2 operations (paraphrased):

  • LOOKUP(dirhandle, name) → handle, attributes
  • GETATTR(handle) → attributes
  • SETATTR(handle, attrs) → attributes
  • READ(handle, offset, count) → data, attributes
  • WRITE(handle, offset, count, data) → attributes
  • CREATE, REMOVE, RENAME, MKDIR, RMDIR, READDIR, SYMLINK, READLINK, STATFS

Every request is self-contained: the file handle, offset, length, and data all travel together. The server doesn’t remember which client opened which file or where in the file the client was last reading. There’s no OPEN and no CLOSE — those happen entirely on the client. The “file descriptor” is a client-side concept layered on top.

This is the whole point. If the server crashes and reboots, clients keep retrying their last RPC; the server processes them as if they were brand-new requests. Recovery is invisible to the application — just a brief hang while RPCs time out and retry.

To make retry safe, every operation must be idempotent. READ and WRITE carry explicit offsets, so reading or writing twice at the same offset gives the same result. MKDIR and CREATE return success even on retry (the server checks if the operation was already done; if so, returns success). REMOVE is the awkward case — calling it twice returns ENOENT the second time, which clients learned to ignore.

Caching and consistency#

To make NFS fast on a LAN, clients aggressively cache: file blocks for reads, attributes (for stat-like calls), and writes (buffered before flushing). The caching is where the otherwise-clean design gets murky.

Read caching#

A client reads a block once; subsequent reads hit the client’s buffer cache without going to the server. To detect server-side changes, the client periodically re-fetches attributes (the file’s mtime and size) and compares. If the mtime moved, the cached blocks are invalidated. This is called close-to-open consistency in v3 — the client revalidates attributes on open() and flushes writes on close().

The polling interval is the cache-coherence knob: too short and you spam GETATTRs; too long and stale reads persist. Default is a few seconds for files, longer for directories.

Write caching#

The client buffers writes locally — the application’s write() returns immediately, before bytes hit the server. Eventually the client flushes (on close(), on fsync(), or under memory pressure).

On the server side, the natural temptation is to also buffer — write to the page cache, ack, flush to disk later. The problem: if the server crashes after the ack but before the flush, the client believes the write is durable but the data is gone. NFS v2 sidesteps this by requiring synchronous server writes: WRITE blocks on the server until bytes hit disk. Safe but slow.

NFS v3 introduced the COMMIT RPC: clients send WRITEs with the UNSTABLE flag (server may buffer), then send COMMIT to force a flush at session boundaries. Much faster, still safe.

The consistency model#

Multiple clients writing the same file at the same time is not consistent in the strict sense. Two clients can both have cached attributes saying mtime = T; one writes; the other reads its stale cache. NFS does not give you the local-file-system illusion when multiple hosts share a file. Production sites use NFS for read-mostly workloads (home directories, shared software trees) and avoid concurrent writers. When concurrent writers are needed (/var/spool/mail), they use file locking via the rpc.lockd daemon.

Crash recovery#

The fault tolerance story is the part NFS got famously right:

  • Server crash and reboot. Clients retry their pending RPCs. The server, on reboot, accepts new RPCs without remembering anything from before. From the application’s perspective, file operations hang briefly during the outage and resume when the server comes back. No reconnect dance, no “lost connection” errors bubble up to applications.
  • Network partition. Clients block on RPCs (or retry with exponential backoff). On reconnection, retries succeed. Soft mounts vs. hard mounts: a hard mount blocks indefinitely (default for / and important data); a soft mount fails with EIO after a timeout (used when liveness is more important than data correctness).
  • Client crash. The server has no state to clean up. Locks held via rpc.lockd are tracked separately and recovered via a separate rpc.statd notification protocol — the lock state was always the awkward bolt-on.

The cost of statelessness is that locks are hard. There’s no server-side notion of “client A has this file open”; locking requires a separate daemon and a separate recovery protocol. NFSv4 brought locking inside the main protocol at the cost of significant complexity.

Operational characteristics#

  • Performance. On a quiet LAN, NFS reads approach local-disk throughput because the client buffer cache handles repeated reads. Writes are slower (synchronous server writes in v2, COMMIT-driven in v3).
  • Caching footprint. Every client maintains its own block cache; aggregate memory used can be many times the working set. Largely a win — reads stay local.
  • Latency tails. A slow server creates D-state (uninterruptible-sleep) processes on every client trying to touch the share. ls /mounted/dir on a degraded NFS server hangs your shell. The fix is intr mount option (interruptible) or soft mounts.
  • Versions deployed. v3 (1995) is still the workhorse on most UNIX shops — UDP or TCP, 64-bit offsets, COMMIT RPC. v4 (2003) added security (Kerberos), compound RPCs (pipeline multiple ops in one round trip), and delegations (AFS-like callback hints). v4.1 (pNFS) added parallel data servers for HPC.
  • Security. v2/v3 trust the client’s UID. Anyone with root on a client can spoof any UID. v4 + Kerberos finally addressed this; widely deployed only in environments that already had Kerberos (universities, large enterprises).

Trade-offs and gotchas#

NFS — stateless — server reboots transparently; recovery is “clients retry.” Simple to operate; locking is bolted-on; cache consistency is weak (close-to-open, not POSIX-strict). Best for read-mostly shared trees on a single LAN.
AFS — stateful with callbacks — server tracks who has each file cached and notifies on writes; whole-file caching minimises round trips; consistency is “last writer wins” but well-defined. More complex recovery; scales to thousands of clients across a WAN.

Other tensions:

  • Synchronous writes vs. throughput. v2’s synchronous server writes are the gating factor on write performance. v3’s COMMIT is the principled fix; some sites still misconfigure with async exports (acks before disk flush) and lose data on power failure.
  • Block cache vs. consistency. Long cache TTL is fast and stale; short is slow and fresh. Production sites tune to read-mostly workloads where staleness is tolerable.
  • hard,intr vs. soft. Hard mounts preserve data at the cost of hanging the client; soft mounts preserve liveness at the cost of EIO. Use hard for home directories and code; soft only for caches or telemetry.
  • no_root_squash. Server option that allows the client’s root to be the server’s root. Almost always wrong — turns a compromised workstation into a server-side root attacker.
  • File handles surviving server changes. A handle encodes inode + generation number. Backing up and restoring the server’s file system may renumber inodes; cached client handles become ESTALE. Clients re-LOOKUP and recover, but every open file is briefly invalid.
  • Locking gotchas. flock() and POSIX fcntl() locks behave differently over NFS. flock was originally local-only; only newer kernels propagate it via rpc.lockd. Many subtle bugs here.
Why did Sun pick UDP for NFS v2?

In 1984 on 10-Mbit Ethernet LANs, UDP was the obvious choice: lower per-packet overhead, no connection state, no head-of-line blocking. The NFS client built its own retransmit and ordering on top — appropriate for stateless idempotent RPCs. The downside became visible as WAN deployments and 100-Mbit/1-Gbit links exposed UDP fragmentation issues and the inability to use TCP congestion control. NFS v3 added TCP support; modern deployments run NFS over TCP exclusively. UDP for NFS was the right answer in its decade and the wrong answer thereafter.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.