Files
terra-view/docs/superpowers/specs/2026-06-17-operator-auth-design.md
T

14 KiB

Operator Authentication — Design & Build Plan

Status: in development (feat/operator-auth) · Targets: 0.15.x · Date: 2026-06-17

Adds a login + roles to the internal Terra-View app — the operator-facing surface that today has zero auth. This is the prerequisite that makes the app safe to expose to the internet (the office-deployment sequencing: operator auth → expose). Expands the "Deferred A" section of 2026-06-15-portal-auth-design.md into a standalone spec.

Goal

Anyone reaching the internal app must log in. Three known users to start (you + two parents), two effective roles, and a dead-simple password-reset story for a family-run shop. Reuses the building blocks the client portal already shipped: the argon2 hasher (backend/auth_passwords.py) and the HMAC signed-cookie pattern (backend/portal_auth.py).

Scope

v1 (this spec): email + password login (argon2) · long-lived "remember this device" session · brute-force lockout · a deny-by-default gate over the whole internal app · superadmin/admin roles · superadmin-only user management · password reset (superadmin-resets-anyone + self-service change + forced change) · a seed CLI to bootstrap · the OPERATOR_AUTH_ENABLED feature-flag rollout.

Deferred (designed-not-built): TOTP 2FA (near-term follow-up, superadmin account first) · the operator restricted role · email-based self-service password reset (needs the email infra coming with the report work).

Principles

  1. Deny by default. Every route requires a login except an explicit allow-list. A route added next year is protected automatically — you can't forget to gate it.
  2. Can't lock yourself out. Ship dark behind a feature flag; seed + verify before enforcing; the flag is an instant escape hatch; a CLI is the break-glass.
  3. Reuse, don't reinvent. argon2 + the signed-cookie HMAC already exist and are tested. Operator auth is a thin new layer, not a parallel crypto stack.
  4. Easy recovery. For a 3-person shop, "forgot my password" must be a 10-second fix — the superadmin resets it, no email round-trip required.

Architecture

                         OPERATOR_AUTH_ENABLED=false ──▶ pass everything (app as today)
 request ──▶ gate middleware ─┤
                              └ enabled ─▶ path exempt? ──yes──▶ serve (no login)
                                              │  exempt: /login /logout /health
                                              │  /static/* /portal/* + 3 machine endpoints
                                              └no─▶ valid operator session?
                                                       ├ no  ─▶ HTML: 303 → /login?next=…
                                                       │         /api/*: 401 JSON
                                                       ├ must_change_password ─▶ 303 → /change-password
                                                       └ yes ─▶ request.state.operator = user
                                                                 ─▶ route runs; require_role() may 403

One Starlette HTTP middleware is the gate (not per-route dependencies — a middleware can't miss a route). It resolves the operator from the cookie using its own SessionLocal() (same pattern the portal WS handler uses), stashes the user on request.state.operator, and a require_role(...) dependency reads it for the few routes that need more than "logged in."

Data model

New table operator_users (brand-new → create_all builds it on startup, no migration needed, same as the portal's clients table):

Column Type Notes
id str UUID caller-supplied str(uuid.uuid4()) (codebase convention)
email str, unique, indexed login handle, stored lowercased
display_name str "Brian", "Dad" — shown in UI + history
password_hash str argon2id via auth_passwords.hash_password
role str "superadmin" | "admin" ("operator" reserved, deferred)
active bool, default True disable a login without deleting
must_change_password bool, default False set on create/reset → forces a change on next login
sessions_valid_from datetime, default utcnow bump to invalidate ALL of a user's sessions
failed_login_count int, default 0 lockout counter
locked_until datetime, nullable set after too many bad tries
created_at datetime, default utcnow
last_login_at datetime, nullable

(Deferred columns, not in v1: totp_secret, totp_enabled.)

Role ladder — a rank map so checks read naturally and operator slots in later:

_ROLE_RANK = {"operator": 10, "admin": 20, "superadmin": 30}

require_role("admin") = admin or above; require_role("superadmin") for account mgmt.

Sessions

New shared module backend/auth_cookies.py — lift the generic signer out so both auth systems share one implementation:

def sign(payload: dict) -> str          # f"{b64url(json)}.{hmac_sha256(b64, SECRET_KEY)}"
def read(raw: str, max_age: int) -> dict | None   # verify sig (compare_digest) + iat expiry; None on tamper/expiry
SECRET_KEY = os.getenv("SECRET_KEY", "dev-insecure-change-me")   # same env the portal reads
COOKIE_SECURE = os.getenv("COOKIE_SECURE", "false") in truthy

Operator auth uses it now. (Portal's existing cookie helpers keep working untouched; migrating them onto auth_cookies is an optional later dedupe, gated on the portal tests staying green — don't destabilize the shipped portal for it.)

Operator session cookie: name tv_session (distinct from the portal's portal_session), payload {"uid": <id>, "iat": <epoch>}, max_age 30 days (= the "remember this device" — a small trusted set re-logs in rarely), httponly, samesite=lax, secure=COOKIE_SECURE.

Validation each request (current_operator(request, db)): read+verify cookie → load OperatorUser by uid → require active, iat >= sessions_valid_from (epoch), and not locked_until > now. Any failure → no session. Bumping sessions_valid_from (on password change / "log out everywhere") instantly kills all live cookies with no session table.

Authorization

The gate (middleware) exempt list:

  • /login, /logout, /health, /static/*, plus PWA assets (/manifest.json, /sw.js, /favicon.ico)
  • /portal/* — the client portal keeps its own (separate) auth
  • machine endpoints (LAN-only, automated, no human): /emitters/report, /api/series3/heartbeat, /api/series4/heartbeat

/change-password is not exempt — it requires a logged-in session (you change your own password). It's only excluded from the must_change_password redirect, so a forced-change user can actually reach it (no redirect loop).

Permission split — minimal by design. Because the operator role is deferred, every real v1 user is admin or superadmin, so "logged in" already means "full app." The only thing gated above plain-admin is account managementrequire_role("superadmin") on the user-management routes. Everything else just requires a valid session (the middleware). One extra guard, not a sprawling matrix.

The flag governs everything. Both the middleware and require_role respect OPERATOR_AUTH_ENABLED: when it's off, neither enforces anything (no session is set, and require_role passes through) — the app behaves exactly as it does today. When it's on, the middleware guarantees request.state.operator is set before any require_role check runs.

Password management & reset (the emphasized requirement)

Three paths, no email infra required:

  1. Superadmin resets anyone — from the user-management UI, "Reset password" → generates a strong password (auth_passwords.generate_password), stores its hash, sets must_change_password=True, shows the temp password once for you to hand off. Covers "easy for me to reset their password."
  2. Self-service change/change-password (any logged-in user): current + new. Used for routine changes and the forced post-reset change. On success, bump sessions_valid_from (logs out other devices) and clear must_change_password.
  3. Forced change — after a reset/first login, must_change_password=True → the gate routes them to /change-password until they set their own.

Forgot it entirely (can't log in): v1 has no email reset/login shows "Forgot your password? Contact your administrator," and you (superadmin) reset it via the UI or CLI. For a 3-person shop that's a text message, not a feature. (Email-based self-service is the deferred follow-up once email infra lands.)

Bootstrapping — seed CLI

backend/operator_admin.py (modeled on the existing portal_admin.py), run inside the container against the live DB:

create-superadmin --email you@x.com --name "Brian"   # prompts for a password (or --generate)
create-user --email dad@x.com --name "Dad" --role admin   # generates a temp password, must_change=True
reset-password --email dad@x.com                     # generates a temp, must_change=True
list                                                 # users + roles + active/locked state
disable --email dad@x.com  /  enable --email dad@x.com

The CLI is the bootstrap (first superadmin, before any UI is reachable) and the break-glass (locked out / forgot everything).

Account-management UI (superadmin-only)

GET /admin/users (page, require_role("superadmin")) + JSON endpoints:

  • list operators (name, email, role, active, locked, last login)
  • add operator (email, name, role) → temp password shown once
  • reset password → temp shown once
  • enable / disable, change role Template templates/admin/users.html. Admins (parents) don't see this; superadmin only.

Login / logout / change-password

  • GET /logintemplates/login.html (email + password, optional ?next=).
  • POST /login → lowercase email, lockout check, argon2 verify; on success set tv_session, stamp last_login_at, clear failed_login_count, redirect to next or /; on must_change_password/change-password; on fail → increment + generic "invalid email or password" (no user-enumeration), lock after 5 → 15 min.
  • GET /logout → clear cookie → /login.
  • GET/POST /change-passwordtemplates/change_password.html.

Error handling

  • Wrong email/password → generic message, increment fail count.
  • ≥5 fails → "too many attempts, try again in 15 minutes" (locked_until).
  • No/expired/forged cookie → HTML routes 303→/login?next=…; /api/* → 401 JSON.
  • Disabled / role-changed / password-changed-elsewhere → bounced on next request (re-validated against the DB every request).
  • Superadmin-only route hit by an admin → 403.

Rollout — the no-self-lockout sequence

  1. Ship with OPERATOR_AUTH_ENABLED=false (default) → the middleware short-circuits, app behaves exactly as today. Deploying can't break or lock anything.
  2. Seed your superadmin via operator_admin.py.
  3. Hit /login and confirm you get a session while the flag is still off (the login routes work regardless of the flag).
  4. Flip OPERATOR_AUTH_ENABLED=true → the gate enforces. Your cookie is valid → you're in. Anything wrong → flip it back off (instant escape hatch).
  5. Create your parents' accounts from /admin/users (temp passwords, they change on first login).
  • Break-glass: operator_admin.py reset-password / create-superadmin in the container; or flag off.

Testing

Reuses the pytest harness from the portal work (docker exec … python -m pytest).

  • Middleware: flag off → every path passes; flag on → exempt paths + the 3 machine endpoints pass with no cookie, a gated HTML path 303s to /login, a gated /api/* path 401s, must_change_password user is routed to /change-password.
  • Login: success sets tv_session; wrong password rejected + counts; 5 wrong → locked (even correct password refused).
  • Roles: require_role("superadmin") route → admin gets 403, superadmin 200.
  • Sessions: bumping sessions_valid_from invalidates an existing cookie.
  • Password: self-change works + clears must_change_password; superadmin reset sets a new hash + must_change_password + returns the raw once.
  • Machine endpoints: /api/series3/heartbeat etc. still 200 with the gate ON and no cookie (regression guard so we never silently break the watchers).

File structure

File Responsibility
backend/auth_cookies.py (new) generic sign/read + SECRET_KEY/COOKIE_SECURE
backend/models.py add OperatorUser
backend/operator_auth.py (new) current_operator, require_role, the gate middleware, login/lockout helpers
backend/routers/operator_auth_routes.py (new) /login, /logout, /change-password
backend/routers/operator_users.py (new) /admin/users page + CRUD (superadmin)
backend/operator_admin.py (new) seed/break-glass CLI
backend/main.py register the gate middleware + routers; OPERATOR_AUTH_ENABLED
templates/login.html, templates/change_password.html, templates/admin/users.html (new) UI

Going to prod

  • New table auto-creates; no migration. Just code + seeding.
  • Set a real SECRET_KEY (shared with the portal cookie) and COOKIE_SECURE=true once on HTTPS — same env knobs already wired in docker-compose.yml.
  • Operator auth is what makes internet-exposing the internal app safe; pair with the (deferred) office deployment + reverse-proxy/TLS work.

Security notes

  • Deny-by-default; client-supplied ids never trusted; every request re-validates the session against the DB (instant revoke via active / sessions_valid_from).
  • Passwords argon2-hashed; generic login errors (no user-enumeration); lockout on brute force; raw temp passwords shown once, never stored or logged.
  • Cookies HttpOnly + SameSite=Lax + Secure (on TLS), HMAC-signed with server-side iat expiry.
  • Known residual until deploy: without TLS the password crosses the wire in cleartext — fix is the deployment-phase TLS (Synology Let's Encrypt / Cloudflare Tunnel). The login is still a massive improvement over today's zero-auth exposure.
  • TOTP 2FA is the near-term follow-up (superadmin first), especially without the UniFi edge in front on the home network.