OpenClaw — Personal AI Assistant

A personal-assistant design: how to compose an agent that mixes calendar, mail, search, and reminders behind a single conversational surface.

System Intermediate
11 min read
system personal-assistant tool-use multi-modal-input

Context#

Personal-assistant agents are deceptively hard. The promise is simple — “ask me about your day, your inbox, your tasks, and act on them” — but the system has to integrate four or five disconnected services, manage credentials per user, reason about user intent across categories (“did they want to read mail or schedule a meeting?”), and stay safe with side-effecting tools (send an email, accept a meeting) that can’t be cleanly undone.

The dominant design failure in this space is tool sprawl: an assistant grows to 40+ tools across calendar, mail, search, notes, reminders, weather, navigation, and the agent’s intent classifier collapses under the load. The system either calls the wrong tool, calls no tool when one was needed, or chains tools in nonsensical orders.

OpenClaw is an open personal-assistant framework that handles this by structuring the agent around capability surfaces — bounded groups of related tools, each with its own sub-agent, fronted by a router that picks the right surface for each turn. The shape isn’t novel (it’s a router/specialist multi-agent pattern) but the design choices around how to bound a surface, how to hand off between surfaces, and how to handle side effects safely are worth studying as a reference design.

Problem#

The concrete problems OpenClaw addresses:

  • Tool sprawl. A personal assistant naturally accumulates dozens of tools. A flat tool list overwhelms the routing model.
  • Cross-domain reasoning. “Schedule a follow-up about the doc Alice sent me last week” touches mail (find Alice’s mail), the doc (open and skim), and calendar (find a slot, send invite). The assistant has to chain across domains without losing context.
  • Side-effect safety. Sending emails and accepting meetings on the user’s behalf is irreversible. The system needs a confirmation policy that protects against agent errors without making every interaction friction-heavy.
  • Long-running context. A personal assistant gets used across days and weeks. “Remind me about that thing next Tuesday” expects the assistant to know what “that thing” was. Episodic memory is part of the product.
  • Cost and latency. Personal-assistant interactions are conversational; latencies under 2 seconds matter. A tower of agent calls per turn isn’t viable.
  • Per-user state. Calendars, mail, and tasks are personal. The agent has to handle credentials, scopes, and consent without dropping into a security mess.

Architecture#

OpenClaw structures the assistant as a router with four capability surfaces, a shared memory layer, and a confirmation gate for side effects.

┌─────────────────────────────────────────────────┐
│ User Input │
│ (text or voice, sometimes with attachment) │
└────────────────────────┬────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Router Agent │
│ Classifies intent into one of: │
│ calendar | mail | search | tasks | meta │
│ Reads short-term context to disambiguate. │
└─────────────┬───────────────────────────────────┘
┌──────────────────────┼──────────────────────┬──────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Calendar │ │ Mail │ │ Search │ │ Tasks/ │
│ Surface │ │ Surface │ │ Surface │ │ Reminders│
│ (sub- │ │ (sub- │ │ (sub- │ │ Surface │
│ agent) │ │ agent) │ │ agent) │ │ │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
└─────────────────────┼──────────────────────┴──────────────────┘
┌──────────────────┐
│ Action Type? │
└───┬──────────┬───┘
│ │
read │ │ write / side-effect
│ │
▼ ▼
┌──────────┐ ┌──────────────────┐
│ Execute │ │ Confirmation Gate│
└────┬─────┘ └─────────┬────────┘
│ │
│ ┌──────┴───────┐
│ │ User confirms│
│ └──────┬───────┘
│ │
└────────┬──────────┘
┌──────────────────┐
│ Shared Memory │
│ (episodic + user │
│ profile) │
└──────────────────┘

The router#

A small fast model classifies each user turn into one of the four capability surfaces (calendar, mail, search, tasks) or a meta-category (the user is asking about prior interactions, or correcting the agent). The router has access to a short rolling context — the last few turns — to disambiguate references (“schedule a follow-up for that”).

The router is deliberately small and fast. If it mis-classifies, the destination sub-agent has an escape hatch to bounce the turn back to the router with a corrected hint. This keeps the router cheap without losing capability.

Capability surfaces#

Each surface is a sub-agent with its own tool list — typically 4 to 8 tools per surface, well-named, with rich docstrings. The surface’s system prompt describes its scope (“you handle calendar reads and writes; if the user asks about mail, return out_of_scope”).

The Calendar Surface has tools like list_events, find_free_slot, create_event, update_event, cancel_event. The Mail Surface has list_inbox, read_thread, search_mail, draft_reply, send_mail. The Search Surface fronts a web search API plus an in-document search. The Tasks Surface manages a per-user todo store.

Per-surface scoping is what makes the agent work. Each surface’s sub-agent only ever sees its 4–8 tools — the model never has to choose from 30 — and its prompt is tailored to its domain.

Confirmation gate#

Every tool is tagged read or write. Read tools execute immediately. Write tools — anything that sends, schedules, deletes, or otherwise changes external state — pass through a confirmation gate. The gate renders the proposed action as a structured summary (subject + recipients for an email; title + time + attendees for a calendar event) and waits for user confirm / edit / cancel.

The gate is configurable per-user — you can mark “auto-confirm low-risk writes” (e.g., creating a todo for myself) while keeping mail-sending confirmation-gated. Defaults err toward confirmation.

Shared memory#

A two-tier memory:

  • Episodic memory — a rolling log of recent interactions, persisted across sessions. Used to resolve “that thing” / “the meeting we discussed” / “the doc I asked about earlier.”
  • User profile — long-term preferences (working hours, default meeting length, frequent contacts, time-zone) stored as a structured record. Populated incrementally from interactions.

Both stores are scoped per user and accessible to all surfaces. The router uses them to disambiguate; the surfaces read them for personalisation; the writes back are mediated through specific memory-update tools.

Key innovations#

What makes OpenClaw work as a personal-assistant reference design:

  1. Routing-then-specialising as the architecture default. Personal assistants are the canonical case for router + specialist; OpenClaw’s contribution is in the boundaries — four surfaces, not twelve. Coarse-grained surfaces beat fine-grained ones because the router’s job stays simple and each surface stays generalist enough to handle its category.

  2. Read/write tagging and a separate gate. Most agent frameworks treat all tools symmetrically. OpenClaw distinguishes them at the schema level and routes writes through a different code path with confirmation. The cost is a per-action confirmation; the benefit is that an agent error can’t send a wrong email without the user seeing it first.

  3. Memory as a shared service, not a sub-agent. OpenClaw doesn’t have a “memory agent.” Memory is a passive store with read/write tools that every surface uses. This avoids the failure mode where a memory agent decides what to remember (often wrongly) and creates a separate inference call per turn.

  4. Per-surface failure escapes. When a surface gets a turn that doesn’t fit its scope, it returns a structured out_of_scope signal that the router treats as a re-route. This avoids the “wrong agent answered weirdly” failure mode where mis-routing produces nonsense.

  5. Streaming-aware design. The router runs as a single small-model call; the surface call streams its response and tool plans. The user sees the agent “thinking” within a few hundred milliseconds, which is qualitatively different from a 2-second blank wait.

Evaluation#

Personal-assistant evaluation is unusually messy — there’s no single correct answer for “compose a polite reply” — so OpenClaw’s eval is structured as task-completion + interaction-quality:

  • Task-completion accuracy. For a held-out set of structured tasks (“schedule a 30-min meeting with X on Tuesday morning”, “find emails from Y in the last week”, “remind me to do Z at 5pm Thursday”), did the agent reach the desired end state? Read-only tasks are scored by retrieval correctness; write tasks by whether the side effect was the right one.
  • Routing accuracy. What fraction of user turns went to the correct capability surface on the first try? Mis-routes are recoverable but each one costs a turn.
  • Confirmation-rate experience. How often did the confirmation gate fire? Users have a tolerance budget; too many confirmations and the assistant becomes annoying. The metric tracks this against task-success.
  • Memory-recall correctness. Given a follow-up turn that references a prior interaction, did the agent retrieve the right context? This is the hardest metric to automate.
  • User satisfaction. Standard NPS / CSAT in deployments, with attention to specific failure modes (wrong send, missed reminder).

Trade-offs and limitations#

Where OpenClaw works well. Users with well-organised calendars and mail. Tasks that fit cleanly into one of the four surfaces. Workflows where the user is happy to confirm writes. Deployments where per-user credentials are easy to provision (workspace integrations).
Where it struggles. Users whose calendar/mail is unstructured (no consistent labelling). Tasks that span more than two surfaces in a single turn (e.g., “summarise yesterday and send the summary to my manager” touches all four). Workflows where any confirmation feels like friction. Long multi-turn tasks that need cross-surface state.

Other limitations:

  • Router error is expensive. A mis-routed turn costs the user a turn of clarification. The router is a single point of failure; investing in its prompt and its model choice is high-leverage.
  • Cross-surface tasks need a coordinator. When a single turn legitimately needs two surfaces (read mail then create calendar event), the architecture’s “one surface per turn” rule forces the agent to split the work across multiple turns. This is fine if the user wants confirmation between steps, awkward otherwise. Some implementations add a meta-agent layer that orchestrates surfaces; OpenClaw’s reference design keeps it simple.
  • Per-user memory grows. Episodic memory unbounded becomes a context-cost problem. Retention policies — what to evict, what to summarise — are part of the deployment, not the framework.
  • Security model is the implementer’s problem. OpenClaw routes; it doesn’t authenticate. Wiring per-user credentials, scope-checking tool calls against the user’s permissions, and protecting the memory store are all left to the integration layer.
  • Voice surface is harder than text. A voice-first assistant has lower latency budgets (under 1s), worse turn-taking signals, and a smaller confirmation surface (confirming an email read out loud is awkward). The reference design works text-first; voice deployments require additional UX work.

Lessons#

Generalisable takeaways from OpenClaw:

  • Bound your tool surface. Five well-named tools per agent beat fifteen vague ones. When you have more capabilities than that, split them across sub-agents — not by adding tools to one giant agent.
  • Separate reads from writes at the schema level. This makes confirmation policy easy to express. Every framework should have a side_effects flag on tools and a default gate that fires when it’s set.
  • Make routing its own (small) model call. Don’t try to make the destination agent also be the router. Single-purpose calls are cheap and accurate; multi-purpose calls drift.
  • Make memory a shared store, not an agent. Memory writes should be deliberate (an explicit remember_fact tool) or structured (a profile field update), not implicit (the model writes whatever it finds salient). This keeps the memory observable and small.
  • Build the confirmation UX first, the agent capability second. The hardest part of shipping a personal assistant is trust. The agent that can send emails has to first earn the right to send emails. Confirmation is the gateway; design it before you design the model.
Why four surfaces, not eight?

Four is the largest coarse-grained classification a small fast routing model handles reliably. Eight tends to produce mis-classifications between adjacent categories (e.g., search vs mail-search). If your domain genuinely needs more capabilities, the right move is a two-level router — coarse routing first (which super-category), then a second routing call within the super-category — rather than a flat eight-way classifier. The cost is one extra call on those turns; the benefit is preserved routing accuracy.

Search ESC

Keyboard shortcuts

Shortcuts are disabled while typing in inputs.