docs: operator-auth design spec (v1 password login + roles + easy reset; 2FA/operator-role deferred)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-17 17:39:49 +00:00
parent 0beba9bfbc
commit 59c19291ca
@@ -0,0 +1,266 @@
# Operator Authentication — Design & Build Plan
**Status:** in development (`feat/operator-auth`) · **Targets:** 0.15.x · **Date:** 2026-06-17
Adds a login + roles to the **internal** Terra-View app — the operator-facing
surface that today has **zero auth**. This is the prerequisite that makes the app
safe to expose to the internet (the office-deployment sequencing: operator auth →
expose). Expands the "Deferred A" section of
[2026-06-15-portal-auth-design.md](2026-06-15-portal-auth-design.md) into a
standalone spec.
## Goal
Anyone reaching the internal app must log in. Three known users to start (you +
two parents), two effective roles, and a **dead-simple password-reset story** for a
family-run shop. Reuses the building blocks the client portal already shipped: the
argon2 hasher (`backend/auth_passwords.py`) and the HMAC signed-cookie pattern
(`backend/portal_auth.py`).
## Scope
**v1 (this spec):** email + password login (argon2) · long-lived "remember this
device" session · brute-force lockout · a **deny-by-default gate** over the whole
internal app · `superadmin`/`admin` roles · **superadmin-only user management** ·
**password reset** (superadmin-resets-anyone + self-service change + forced change)
· a **seed CLI** to bootstrap · the `OPERATOR_AUTH_ENABLED` **feature-flag rollout**.
**Deferred (designed-not-built):** TOTP 2FA (near-term follow-up, `superadmin`
account first) · the `operator` restricted role · email-based self-service
password reset (needs the email infra coming with the report work).
## Principles
1. **Deny by default.** Every route requires a login *except* an explicit allow-list.
A route added next year is protected automatically — you can't forget to gate it.
2. **Can't lock yourself out.** Ship dark behind a feature flag; seed + verify before
enforcing; the flag is an instant escape hatch; a CLI is the break-glass.
3. **Reuse, don't reinvent.** argon2 + the signed-cookie HMAC already exist and are
tested. Operator auth is a thin new layer, not a parallel crypto stack.
4. **Easy recovery.** For a 3-person shop, "forgot my password" must be a 10-second
fix — the superadmin resets it, no email round-trip required.
## Architecture
```
OPERATOR_AUTH_ENABLED=false ──▶ pass everything (app as today)
request ──▶ gate middleware ─┤
└ enabled ─▶ path exempt? ──yes──▶ serve (no login)
│ exempt: /login /logout /health
│ /static/* /portal/* + 3 machine endpoints
└no─▶ valid operator session?
├ no ─▶ HTML: 303 → /login?next=…
│ /api/*: 401 JSON
├ must_change_password ─▶ 303 → /change-password
└ yes ─▶ request.state.operator = user
─▶ route runs; require_role() may 403
```
One **Starlette HTTP middleware** is the gate (not per-route dependencies — a
middleware can't miss a route). It resolves the operator from the cookie using its
own `SessionLocal()` (same pattern the portal WS handler uses), stashes the user on
`request.state.operator`, and a `require_role(...)` dependency reads it for the few
routes that need more than "logged in."
## Data model
New table **`operator_users`** (brand-new → `create_all` builds it on startup, **no
migration needed**, same as the portal's `clients` table):
| Column | Type | Notes |
|---|---|---|
| `id` | str UUID | caller-supplied `str(uuid.uuid4())` (codebase convention) |
| `email` | str, unique, indexed | login handle, stored lowercased |
| `display_name` | str | "Brian", "Dad" — shown in UI + history |
| `password_hash` | str | argon2id via `auth_passwords.hash_password` |
| `role` | str | `"superadmin"` \| `"admin"` (`"operator"` reserved, deferred) |
| `active` | bool, default True | disable a login without deleting |
| `must_change_password` | bool, default False | set on create/reset → forces a change on next login |
| `sessions_valid_from` | datetime, default `utcnow` | bump to invalidate ALL of a user's sessions |
| `failed_login_count` | int, default 0 | lockout counter |
| `locked_until` | datetime, nullable | set after too many bad tries |
| `created_at` | datetime, default `utcnow` | |
| `last_login_at` | datetime, nullable | |
(Deferred columns, not in v1: `totp_secret`, `totp_enabled`.)
**Role ladder** — a rank map so checks read naturally and `operator` slots in later:
```python
_ROLE_RANK = {"operator": 10, "admin": 20, "superadmin": 30}
```
`require_role("admin")` = admin or above; `require_role("superadmin")` for account mgmt.
## Sessions
**New shared module `backend/auth_cookies.py`** — lift the generic signer out so both
auth systems share one implementation:
```python
def sign(payload: dict) -> str # f"{b64url(json)}.{hmac_sha256(b64, SECRET_KEY)}"
def read(raw: str, max_age: int) -> dict | None # verify sig (compare_digest) + iat expiry; None on tamper/expiry
SECRET_KEY = os.getenv("SECRET_KEY", "dev-insecure-change-me") # same env the portal reads
COOKIE_SECURE = os.getenv("COOKIE_SECURE", "false") in truthy
```
Operator auth uses it now. (Portal's existing cookie helpers keep working untouched;
migrating them onto `auth_cookies` is an optional later dedupe, gated on the portal
tests staying green — don't destabilize the shipped portal for it.)
**Operator session cookie:** name **`tv_session`** (distinct from the portal's
`portal_session`), payload `{"uid": <id>, "iat": <epoch>}`, `max_age` 30 days
(= the "remember this device" — a small trusted set re-logs in rarely), `httponly`,
`samesite=lax`, `secure=COOKIE_SECURE`.
**Validation each request** (`current_operator(request, db)`): read+verify cookie →
load `OperatorUser` by `uid` → require `active`, `iat >= sessions_valid_from`
(epoch), and not `locked_until > now`. Any failure → no session. Bumping
`sessions_valid_from` (on password change / "log out everywhere") instantly kills all
live cookies with no session table.
## Authorization
**The gate (middleware) exempt list:**
- `/login`, `/logout`, `/health`, `/static/*`, plus PWA assets
(`/manifest.json`, `/sw.js`, `/favicon.ico`)
- `/portal/*` — the client portal keeps its own (separate) auth
- **machine endpoints (LAN-only, automated, no human):** `/emitters/report`,
`/api/series3/heartbeat`, `/api/series4/heartbeat`
`/change-password` is **not** exempt — it requires a logged-in session (you change
*your own* password). It's only *excluded from the `must_change_password` redirect*,
so a forced-change user can actually reach it (no redirect loop).
**Permission split — minimal by design.** Because the `operator` role is deferred,
every real v1 user is `admin` or `superadmin`, so "logged in" already means "full
app." The *only* thing gated above plain-admin is **account management**
`require_role("superadmin")` on the user-management routes. Everything else just
requires a valid session (the middleware). One extra guard, not a sprawling matrix.
**The flag governs everything.** Both the middleware *and* `require_role` respect
`OPERATOR_AUTH_ENABLED`: when it's off, neither enforces anything (no session is set,
and `require_role` passes through) — the app behaves exactly as it does today. When
it's on, the middleware guarantees `request.state.operator` is set before any
`require_role` check runs.
## Password management & reset *(the emphasized requirement)*
Three paths, no email infra required:
1. **Superadmin resets anyone** — from the user-management UI, "Reset password" →
generates a strong password (`auth_passwords.generate_password`), stores its hash,
sets `must_change_password=True`, **shows the temp password once** for you to hand
off. Covers "easy for *me* to reset *their* password."
2. **Self-service change**`/change-password` (any logged-in user): current + new.
Used for routine changes **and** the forced post-reset change. On success, bump
`sessions_valid_from` (logs out other devices) and clear `must_change_password`.
3. **Forced change** — after a reset/first login, `must_change_password=True` → the
gate routes them to `/change-password` until they set their own.
**Forgot it entirely (can't log in):** v1 has **no email reset**`/login` shows
"Forgot your password? Contact your administrator," and you (superadmin) reset it via
the UI or CLI. For a 3-person shop that's a text message, not a feature. (Email-based
self-service is the deferred follow-up once email infra lands.)
## Bootstrapping — seed CLI
`backend/operator_admin.py` (modeled on the existing `portal_admin.py`), run inside
the container against the live DB:
```
create-superadmin --email you@x.com --name "Brian" # prompts for a password (or --generate)
create-user --email dad@x.com --name "Dad" --role admin # generates a temp password, must_change=True
reset-password --email dad@x.com # generates a temp, must_change=True
list # users + roles + active/locked state
disable --email dad@x.com / enable --email dad@x.com
```
The CLI is the bootstrap (first superadmin, before any UI is reachable) **and** the
break-glass (locked out / forgot everything).
## Account-management UI (superadmin-only)
`GET /admin/users` (page, `require_role("superadmin")`) + JSON endpoints:
- list operators (name, email, role, active, locked, last login)
- add operator (email, name, role) → temp password shown once
- reset password → temp shown once
- enable / disable, change role
Template `templates/admin/users.html`. Admins (parents) don't see this; superadmin only.
## Login / logout / change-password
- `GET /login``templates/login.html` (email + password, optional `?next=`).
- `POST /login` → lowercase email, lockout check, argon2 verify; on success set
`tv_session`, stamp `last_login_at`, clear `failed_login_count`, redirect to `next`
or `/`; on `must_change_password``/change-password`; on fail → increment +
generic "invalid email or password" (no user-enumeration), lock after 5 → 15 min.
- `GET /logout` → clear cookie → `/login`.
- `GET/POST /change-password``templates/change_password.html`.
## Error handling
- Wrong email/password → generic message, increment fail count.
- ≥5 fails → "too many attempts, try again in 15 minutes" (`locked_until`).
- No/expired/forged cookie → HTML routes 303→`/login?next=…`; `/api/*` → 401 JSON.
- Disabled / role-changed / password-changed-elsewhere → bounced on next request
(re-validated against the DB every request).
- Superadmin-only route hit by an admin → 403.
## Rollout — the no-self-lockout sequence
1. Ship with `OPERATOR_AUTH_ENABLED=false` (default) → the middleware short-circuits,
app behaves **exactly as today**. Deploying can't break or lock anything.
2. Seed your `superadmin` via `operator_admin.py`.
3. Hit `/login` and confirm you get a session **while the flag is still off** (the
login routes work regardless of the flag).
4. Flip `OPERATOR_AUTH_ENABLED=true` → the gate enforces. Your cookie is valid → you're
in. Anything wrong → flip it back off (instant escape hatch).
5. Create your parents' accounts from `/admin/users` (temp passwords, they change on
first login).
- **Break-glass:** `operator_admin.py reset-password` / `create-superadmin` in the
container; or flag off.
## Testing
Reuses the pytest harness from the portal work (`docker exec … python -m pytest`).
- **Middleware:** flag off → every path passes; flag on → exempt paths + the 3 machine
endpoints pass with no cookie, a gated HTML path 303s to `/login`, a gated `/api/*`
path 401s, `must_change_password` user is routed to `/change-password`.
- **Login:** success sets `tv_session`; wrong password rejected + counts; 5 wrong →
locked (even correct password refused).
- **Roles:** `require_role("superadmin")` route → admin gets 403, superadmin 200.
- **Sessions:** bumping `sessions_valid_from` invalidates an existing cookie.
- **Password:** self-change works + clears `must_change_password`; superadmin reset
sets a new hash + `must_change_password` + returns the raw once.
- **Machine endpoints:** `/api/series3/heartbeat` etc. still 200 with the gate ON and
no cookie (regression guard so we never silently break the watchers).
## File structure
| File | Responsibility |
|---|---|
| `backend/auth_cookies.py` *(new)* | generic `sign`/`read` + `SECRET_KEY`/`COOKIE_SECURE` |
| `backend/models.py` | add `OperatorUser` |
| `backend/operator_auth.py` *(new)* | `current_operator`, `require_role`, the gate middleware, login/lockout helpers |
| `backend/routers/operator_auth_routes.py` *(new)* | `/login`, `/logout`, `/change-password` |
| `backend/routers/operator_users.py` *(new)* | `/admin/users` page + CRUD (superadmin) |
| `backend/operator_admin.py` *(new)* | seed/break-glass CLI |
| `backend/main.py` | register the gate middleware + routers; `OPERATOR_AUTH_ENABLED` |
| `templates/login.html`, `templates/change_password.html`, `templates/admin/users.html` *(new)* | UI |
## Going to prod
- New table auto-creates; **no migration**. Just code + seeding.
- Set a real `SECRET_KEY` (shared with the portal cookie) and `COOKIE_SECURE=true`
once on HTTPS — same env knobs already wired in `docker-compose.yml`.
- Operator auth is what makes internet-exposing the internal app safe; pair with the
(deferred) office deployment + reverse-proxy/TLS work.
## Security notes
- Deny-by-default; client-supplied ids never trusted; every request re-validates the
session against the DB (instant revoke via `active` / `sessions_valid_from`).
- Passwords argon2-hashed; generic login errors (no user-enumeration); lockout on
brute force; raw temp passwords shown once, never stored or logged.
- Cookies `HttpOnly` + `SameSite=Lax` + `Secure` (on TLS), HMAC-signed with server-side
`iat` expiry.
- **Known residual until deploy:** without TLS the password crosses the wire in
cleartext — fix is the deployment-phase TLS (Synology Let's Encrypt / Cloudflare
Tunnel). The login is still a massive improvement over today's zero-auth exposure.
- TOTP 2FA is the near-term follow-up (superadmin first), especially without the UniFi
edge in front on the home network.