Draft · date TBD

AI-Augmented Research & Development

How I Use AI for Research & Development

A working tour of the patterns I actually use with claude.ai and Claude Code — and the places I'd still improve them. Most of it is model-agnostic.

Victor Weeks · Research Software Engineer
NSF NCAR · Research Applications Laboratory (RAL)

This material is based upon work supported by the NSF National Center for Atmospheric Research, a major facility sponsored by the U.S. National Science Foundation and managed by the University Corporation for Atmospheric Research. This work is also supported by the Better Scientific Software Fellowship Program, funded by the U.S. Department of Energy and National Science Foundation.

The top row is the whole story. Your track is just how deep you stop.

🔬 Scientist → claude.ai

Read & critique papers, run analysis, make figures. Stay on the top row; drop into the L1–L2 dives.

⚙️ Engineer → Claude Code

Multi-file changes, tests, agent orchestration in your repo. Press DOWN into the L3 floors.

Shared core (everyone)

Data & Terms of Use · tool security · model & effort · cost.

Press DOWN = more technical: L1 concept — what & why L2 practice — how, recipes L3 expert — config, code

Stay on the top row and you miss nothing required. The vertical descents are optional — take them on demand or save them for Q&A. newcomer & skeptic sheet

Tell each half of the room which slides are "theirs," and that the down-arrow lets either go deeper on demand. This is the navigation contract for the whole talk: horizontal = shared narrative, vertical = opt-in depth. The slide number itself encodes it — a plain 3 is the spine, 3.2 means you've gone two floors down, and it matches the URL. Reassure scientists explicitly; bait engineers downward. Point first-timers and skeptics at the newcomer-skeptic handout (getting started + when to trust the tool).

The horizontal axis is the story everyone follows. Vertical descents are optional deep dives.
One depth scheme everywhere: L1 concept → L2 how → L3 code/config.
You will not miss anything required by staying on the top row — I'll say so out loud each time.

Track tags appear on every slide: Both Scientist Engineer — so you always know whether a slide is aimed at you.

Act 1

The throughline: small teams of agents

One move underlies everything that follows. Learn it once, recognise it everywhere.

Decompose → assign altitude → make the hand-off concrete

I don't "chat with an AI." I stand up a small, role-specialized team of agents.
Each agent gets a persona (its job, and its named blind spots) and a budget — altitude = which model and how much effort.
Artifacts carry the state between agents: a review doc, an HTML deck, a JSON change-set, a security-warning file.

1 · Decompose

split the task into roles a fresh context can own

2 · Assign altitude

model × effort per role

3 · Concrete hand-off

an inspectable artifact, not a vibe

Most of what follows — tool security, skills, the collab wrapper, model routing — is a variation on this one move.

A sub-agent = a child Claude with its own context window and one scoped job

The diagram below is the actual pipeline that produced this deck's research and review (a DAG — a one-way graph of steps).

Orchestrator

plans & routes

Phase 1 · fan-out

9 discovery + 8 personas
Sonnet · medium

⊟ barrier

wait for all — synthesis can't start early

Phase 2 · synthesize

3 agents · Opus · high

Phase 3–4

build → self-critique

Fan-out turns latency into throughput — conditionally. 9 agents ≈ the time of one only below your usage-bucket ceiling and when the work is truly independent. Above the ceiling, throttling erases the wall-clock win and you still pay the extra tokens.

Isolation buys focus & context separation — not correctness, and not an OS security boundary. You add correctness at the seams (a schema, a critic agent, a human gate).

Personas are cheap reviewers with named blind spots. orchestrator sheet

Separation of concerns: a planner that can't write code can't paper over a bad design with a clever implementation.
A fresh-context reviewer isn't anchored by the author's rationalizations — it reads the work, not the intentions behind it.
Real examples: a two-persona documentation review (domain-expert vs. target-reader); science/engineering/UX review loops baked into a project's docs/phase-N/.

Scientists already know this shape: an independent-then-dependent pipeline of stages.

Pick the roles, set the barriers, choose per-agent model + effort, and define the hand-off artifact.
Plan-then-implement split: a read-only architect emits a file:line blueprint → a fresh implementer executes it and verifies (e.g. runs the tests, or drives Playwright).
Validation lives at the seams — pick at least one: an assertion, a schema check, a critic agent, or a human gate before a result is consumed.

Orchestration buys you parallelism and separation. It does not buy reliability — you add that deliberately.

Today my DAG lives in orchestrator prose. The improvement: promote it to a versioned workflow schema — diffable, re-runnable, with per-agent model + effort pinned.
Pin roles as first-class agents/*.md definitions; publish them as plugin agents so they're portable, not retyped.
Harden the planner with plan mode + a read-only tool allowlist — enforced, not by convention.
A real run: a ~46-agent Sonnet review spread across 44 sessions, kept alive by a heartbeat/cron across the 5-hour window.
Cap reviewer count by marginal yield — a meta-agent that reports "new findings per added reviewer," so you stop when it flattens.

Honest beat: my 14- and 46-agent fan-outs are probably past the point of diminishing returns.

Act 2

Two surfaces: claude.ai vs Claude Code

Where the scientist and the engineer split — and the one question that tells you which to use.

Same model family, very different relationship to your files

	claude.ai (web · scientist home)	Claude Code (CLI · engineer home)
Access	Conversational, self-contained chat	Reads & writes the real files in your repo
Tools	Artifacts you copy out by hand	Runs commands, tests, MCP tools, sub-agents
Autonomy	One turn at a time	Iterates on its own; can be scheduled or driven remotely (start a run, check it from your phone)
Verification	Your eyeballs	The agent runs the tests / Playwright for real
Cost shape	Lower per turn…	…but draws on the same weekly bucket

The tipping point

Am I copy-pasting code or file contents back and forth more than twice? → switch to the CLI.

Scientist, plainly

You mostly don't need the CLI. Its whole payoff is the file/command/repo loop you don't have — if you don't have that loop, web is your home.

Research / explain / brainstorm

→ web chat

One-file change

→ CLI, single agent

Parallel or overnight work

→ CLI workflow with sub-agents

Teams: standardize on git as the shared artifact — web explorers and CLI engineers then converge in one history instead of two parallel worlds.

I used to set response styles on claude.ai (Normal / Concise / Explanatory / Formal + custom) from the composer's "Use style" menu.
I can no longer reliably find it — at least not with Opus.
Honest read: it appears to be relocating/merging into Skills, not disappearing — but I'd verify live in my own account before relying on it.

This is a teaching moment, not a complaint: features move. Build habits that survive the UI changing under you.

Web baseline

chat · copy-out artifacts · your eyeballs

Claude Code adds

filesystem read/edit · run tools & tests · MCP · plugins · hooks · sub-agents · scheduled + remote-controlled runs · self-verification via Playwright

The styles knob's spiritual successor in the CLI is declarative: CLAUDE.md + skills (including an output-style skill) shape response form. (The standalone "output styles" knob has been in flux and looks to be folding into skills — ⚠ verify live.)
Verification is the real divide: web = a human reads it; CLI = the agent runs it in a real environment.

More context isn't more signal — curate the window like you curate an experiment

`/context` — see what fills the window

A colored map of what's consuming context: history, open files, CLAUDE.md, memory, skill + MCP definitions. The honest answer to "why is it drifting?"

`/btw` — ask an aside, for free

Spins up an ephemeral agent for a quick question (a commit SHA, a flag) — the answer is not kept in history, so the main thread stays lean. One reply, no tools.

`/compact` — shrink & stay

summarize & keep going — same session, smaller. CLAUDE.md + memory re-inject automatically.

`/clear` — fresh start

new session; the old one stays in /resume. Project files & memory persist; the chat history doesn't.

`/rewind` — undo a turn

roll code + conversation back to a checkpoint — not just the text on screen.

Bonus keystroke · Ctrl+S stashes your draft: half-written a prompt but need to run a command or ask a /btw first? Set the draft aside, handle the side-thing, then bring it back — no retyping. One draft held; a hint shows it's waiting.

Rule of thumb: /compact when the thread is long but still on-task · /clear when you switch tasks · /btw for a one-off you don't want to remember. Auto-compact fires near the limit — but a deliberate reset beats a forced one. engineer / RSE sheet

This is the practical complement to the "context bloat" challenge. The durable lesson for the whole room: a long context isn't a smarter one — buried details get out-weighted and quality drifts, so reset deliberately. Map the four moves to intent: /context = diagnose, /compact = stay-and-shrink, /clear = switch tasks, /btw = aside-without-cost, /rewind = undo, Ctrl+S = stash a half-written prompt while you run a quick side-task. /btw and the Ctrl+S stash are both genuinely underused — they keep tactical lookups and abandoned-then-resumed drafts out of the working thread. Scientists: this is the same instinct as not re-running an entire notebook to answer one question. ⚠ the stash keystroke and the slash-command set were verified against the installed build (v2.1.x) — but they're version-volatile, so confirm with /help live. (Note: this "stash" is the prompt stash, distinct from the git working-tree auto-stash Claude uses internally for "teleport.")

claude --worktree <name> (-w) spins up an isolated git checkout at .claude/worktrees/<name>/ — run several agents in parallel without them stepping on each other's files.
The /clear-safety question you asked for: if you wipe context mid-worktree, how does the agent remember it's on one? It doesn't need a special marker — the working directory persists across a clear, and the worktree's CLAUDE.md reloads every session. Location is the memory.
Pair with /rewind (checkpoints) and /resume (its picker labels the worktree) so a long, sandboxed job is restartable, not all-or-nothing — the safe shape for Act 1's overnight runs.

Transfers: Codex CLI has its own /clear and /compact, and uses AGENTS.md where Claude Code uses CLAUDE.md. /btw, /rewind, and the Ctrl+S prompt-stash look Claude-Code-specific — no Codex analogue found — ⚠ verify before assuming one.

Act 3

Extending Claude Code

MCP, skills, plugins — where the CLI stops being a chat box and becomes a platform. Security is the load-bearing slide.

MCP = Model Context Protocol — the model calls a typed tool against the live system

With MCP

model → typed tool call → live system → a real, current result

Without

model → guesses from training data → plausible, maybe stale or wrong

That's the gap between a plausible answer and a correct one — it matters when a scientist acts on the output.
My setup: 6 plugin MCP servers (Cloudflare API / bindings / docs / observability / builds, + Playwright), activated per-session via /mcp rather than left always-on.

Read this before you connect one: a local (stdio) MCP server runs as you, with your full OS permissions. Security is the next slide. security & governance sheet

Surface 1 · the model

can be tricked into calling a tool — prompt injection, including text that arrives inside a tool's own output ("ignore previous instructions…").

Surface 2 · the tool

can itself be malicious — a supply-chain risk, like an unreviewed dependency or a hostile browser extension.

"Trust the model" is not a security boundary. Treat MCP tool output as untrusted input.

Mitigations, cheapest & highest-value first

first-party servers · pin versions · read the source

scope to a project dir

no creds in the inherited env

prefer read-only tools

human approval for write/exec

keep an allowlist

Project-scoped .mcp.json is the secure default — a deck-building project simply has no Cloudflare API surface attached.
Least privilege limits the blast radius — what a manipulated model or a bad tool can actually reach.

security-guidance@claude-plugins-official — layer 1 observed directly; layers 2–3 inferred from behavior

① instant regex pattern warnings on Edit/Write (~25 patterns)

② appears to run a model-backed diff review each turn, before you see the reply

③ appears to run a commit-time review that reasons across files on git commit

Real catches: an innerHTML XSS in a weather-avoidance tool; a newline-injection where attacker-controllable JSON reached a workflow comment and injected an executed directive.
High-recall, low-precision by design: it false-fired on the literal string torch.load(weights_only=False) sitting in prose — which is exactly why layers 2 & 3 exist.
Improvement: promote the commit review into a PreToolUse / commit hook that BLOCKS high-severity findings; curate the regex set per project to fight alarm fatigue.

A packaged, versioned, auto-discovered prompt (+ optional scripts) loaded only when it's triggered

~15 installed (point-in-time count from a ~/.claude scan); several are retrieval-first — notably the Cloudflare-platform skills.
Why retrieval-first matters: the Cloudflare API changes faster than the model's training, so the skill fetches current docs instead of trusting memory — fixing the failure mode of a confident, stale answer.
In-house example: I authored ncar-brand-toolkit@local — SVG waves, logo lookup, brand & accessibility rules. Institutional knowledge → an executable capability.

engineer / RSE sheet

A skill is a packaged prompt. For a one-off, a well-crafted prompt is equivalent — and lower overhead.
Heuristic: prompt for personal/once, skill for shared/repeated/governed. Don't add the abstraction until duplication actually hurts.
Solo grad student → keep a prompt library in a repo. Team / RSE → the skill earns its keep as shared, maintained machinery.

A skill only helps if it fires — write trigger-quality descriptions, and use skill-creator's eval harness to measure & tune trigger accuracy.
A thin "house style" output skill can make HTML-artifact-first the default (see the Artifacts slide) instead of asking for it every time.
Right-size the catalog: ~11 Cloudflare skills loaded globally is dead weight on non-Cloudflare work — scope them per project.

How a team standardizes practice instead of re-teaching every engineer

One plugin =

skills

agents

commands

hooks

.mcp.json

One installable, versioned unit — the lever is shared governance, not the install count (~22 installed, point-in-time).
Improvement: publish my recurring roles — the two-persona reviewer, the sci/eng/UX loop — as plugin agents, not just docs. One command, reusable, demoable.

What goes in: agents/*.md, skills, slash-commands, hooks, .mcp.json.
Versioning + distribution = team governance: control which servers and skills are configured, centrally — ties straight to the governance discussion in Act 7.

This is the lever that makes my signature methods reproducible by someone who isn't me.

Act 4

Tuning the dials

Model & effort, artifact-driven output, and the one technique I most want to ship.

Task type	Model	Effort
Discovery · grep · triage	Haiku → Sonnet	low–medium
Codegen · refactor · drafting	Sonnet	medium
Boilerplate · retrieval	Sonnet	low
Synthesis · architecture · judgment	Opus	high
Multi-step physics · legacy ports	Opus	high–xhigh

Default: Sonnet + medium. Escalate only when the first pass is wrong, or when being wrong is expensive.

Keep two Opus ratios distinct: ~5× Sonnet per API token (the price ratio) ≠ the ~10–12× weekly-hours gap on a subscription (Opus burns more thinking tokens against a tighter cap). ⚠ both move — verify live.

The low / medium / high / xhigh ladder and /effort are CLI / API concepts; claude.ai is coarser — a model picker + an extended-thinking toggle.

Same instinct as not running a coarse sensitivity sweep at full production resolution.
Recall problems (discovery, grep, triage) → Sonnet/medium, run many in parallel.
Precision problems (synthesis, architecture, judgment) → Opus/high, where one bad call poisons everything downstream.

Scientists get this immediately through the grid-resolution analogy: match the resolution to the question.

My globals are maxed: effortLevel: xhigh, alwaysThinkingEnabled: true. Haiku never appears.
The precise gap — and it's narrower than it looks: my documented discovery sub-agents are already correctly tiered (Sonnet/medium). The two real leaks are exactly (a) no Haiku tier at all and (b) a maxed main-loop / orchestrator default.
Fix: set a modest main-loop default, add a Haiku scout tier (Haiku scout → Sonnet worker → Opus judge), raise /effort only where precision pays.
Demonstrate a deliberate low-effort run so the room sees where it's plenty — calibration beats "always max."

skipWorkflowUsageWarning: true only suppresses the one-time pre-launch dialog for multi-agent runs — it is not a quota meter and says nothing about being near a limit. Easy to misread.

For a dense answer, ask for a self-contained HTML artifact instead of a wall of prose

A dense answer becomes something you can act on: scrub an animation, sort a table, toggle a parameter — you build intuition far faster than by reading linearly.
SVG is resolution-independent and diffable as text, and the model authors it directly.

claude.ai · Artifacts

a sandboxed, size-limited preview pane — great for quick explainers you copy out.

Claude Code · on disk

writes a self-contained file to disk — no size sandbox. Opens offline, embeds in a deck, attaches to a PR.

Live, in this slide → fan-out width: 3 agents

That slider is the technique demonstrating itself. Every diagram in this deck is the same pattern. scientist sheet

A 9-section path-finding explainer: embedded SVG + an interactive heading-rotation toy + a side-by-side search animation.
A per-PR HTML review aid that renders the diff with margin annotations and severity colors.
Reveal.js slides that embed live artifacts — like the one you're in.
A self-contained flat-vector explainer comic: The Collab Loop.

Template the recurring genres (explainer, PR-review aid) as skills or frontend-design templates so they're one ask, not a rebuild.

Mind the weight: a 17.6 MB single-file Leaflet viewer (35 flight×CDO combos pre-serialized) is portable but a perf/review liability. Note: that's a Claude Code on-disk file — it would blow past claude.ai's Artifact size limit. Offer lazy-load / split-data variants.
Artifacts are a security surface: the innerHTML XSS catch from Act 3 lived inside one. Pair "request an artifact" with "use safe DOM construction."

A convenience pattern must not quietly teach an unsafe one.

An idea I have a design for, not a tool I've built — here's exactly what it is and why I want it

① click any element

② comment / edit its text

③ export one JSON change-set

④ one batch follow-up prompt

⑤ all edits applied as one diff

It fixes the pointing problem — "make that heading smaller — no, the other one" costs several exchanges in prose; a click makes it one.
It would collapse N round-trips into one structured, batched pass, and the JSON is an audit trail of exactly what changed.

Honest status: precisely articulated, not yet built — the worktree is near-empty. This is the thing I most want to ship. orchestrator sheet

Pointing: let the human click instead of describe → a fuzzy spatial request becomes a machine-addressable, element-anchored change.
Batching: one context-rich pass over 20 edits is cheaper and more coherent than 20 round-trips.
It's a lightweight typed protocol for design feedback — like MCP, but human → agent.

For a scientist that's quota saved; for a reviewer it's an archived change-set — the JSON is the record.

Robustness: positional references break if the page is reordered between export and apply → anchor to stable IDs and validate before committing.
Audit trail: the JSON archives next to the committed diff — a precise record of what changed and why.
Hygiene: for unpublished work, remember the changed content round-trips through the model.
Accessibility: using a provided wrapper needs no coding; building one does — presenter ships the wrapper, the scientist just clicks and exports.

Anchor each comment to a stable selector / data-collab-id injected at wrap time, so edits survive a DOM re-render.
Develop in an isolated worktree so the wrapper iterates without touching the deck it wraps.
Make the JSON a versioned schema, validated on apply, with a dry-run / diff preview before changes land.
Alternative ingest: a Playwright / claude-in-chrome path that reads the annotated DOM directly.

the payoff: wrap THIS deck with the tool and iterate live on stage

Act 5

Economics & honest limits

What it costs, how the buckets work, and the data question to settle before you connect anything.

Modality	The trade	Relative tokens
a · copy/paste chat	max human effort, can't verify, lowest risk	low
b · inline autocomplete	external tool, not Claude Code — Copilot / Cursor Tab / Windsurf; high frequency	moderate
c · CLI single-agent	one rich context, many tool-call turns	mod–high
d · CLI workflow + sub-agents	min labor, max capability, hardest to inspect	highest

Decision rule: research → web chat (a); one-file change → (c); parallel/overnight → (d). I live in (d); tiering models inside it is what keeps it affordable.

LSP (pyright/clangd/gopls) is a different thing entirely: deterministic, zero-token code intelligence for the agent (go-to-def, types, diagnostics) — not a model-based completion modality. (b) isn't a Claude Code modality at all.

(a) low–moderate per turn, but re-pastes context every turn (waste) + high human cost.
(b) low per suggestion, high frequency → moderate aggregate.
(c) moderate–high: one rich context, many tool-call turns.
(d) highest: N concurrent contexts × extended thinking — and the biggest lever you control is Haiku/Sonnet/Opus tiering.

Silent budget killers (Q7): a long context re-sent every turn, and orchestration overhead you can't see. The fix is per-workflow token/hour telemetry — see Act 7.

⚠ Every figure here is a placeholder — verify live before you present. Teach the framework, not the number; prefer a live /status check to a quoted limit.

	Claude	Codex (OpenAI)	Antigravity (Google)
Short window	~5-hr rolling	5-hr message ranges	agent-request based
Long window	weekly + a separate Opus cap	credit rate card	~tripled at I/O 2026
Metering	tokens × premium × effort	tokens (since 2026-04-02)	requests
Training default	consumer opt-out ⚠	verify	verify
To use Claude models	—	—	needs your own Anthropic API key (a prerequisite, not an overage valve)

What it costs (⚠ verify live)

Pro $20/mo · Max 5× $100 · Max 20× $200. On a subscription, Opus drains the weekly budget ~10–12× faster than Sonnet (≈480 Sonnet vs ≈40 Opus hrs on Max 20× — verify). Claude shares one bucket across Claude Code + claude.ai + Cowork.

Bucket 1 · ~5-hour rolling

the burst limit most interactive users hit first.

Bucket 2 · weekly cap

rolling 7-day — Claude has two: overall + a separate Opus cap.

"Suddenly less helpful mid-conversation" = you hit a window. Design long jobs to be resumable, not all-or-nothing. Check it in-product: /status.

Claude: Pro $20 / Max 5× $100 / Max 20× $200; weekly published as expected hours (Max20× ≈ 240–480 Sonnet / 24–40 Opus hrs ⚠). 5-hour limits doubled in 2026.
Codex: token-based since 2026-04-02; per-5-hour message ranges + a credit rate card per 1M tokens; images drain several× faster; buy extra credits at the cap.
Antigravity: free public preview, agent-request based, limits ~tripled at I/O 2026; using Claude models reportedly needs your own Anthropic API key ⚠.

Every number gets a source + an access date. The point is that the checklist exists, not the digits.

Add per-workflow token/hour telemetry: capture real consumption per run (tokens/hours by model tier) and show a measured cost breakdown — e.g. of building this very deck.
Adopt the Haiku tier as the single budget move that most extends the weekly cap.

A measured cost breakdown beats any vendor headline number. (skipWorkflowUsageWarning is unrelated — it hides a dialog, not a quota.)

Consumer claude.ai (Free/Pro/Max)

Trains on your conversations by default — you must opt out.

Commercial / API / Team / Enterprise / Gov / Edu

Not trained by default · ~30-day retention · zero-retention agreements available.

⚠ Safety-review carve-out: flagged conversations may still be used regardless of your opt-out — verify the live wording.

Decide your trust boundary first. Document which plan/surface your group is approved to use. Claude Code on a commercial/API plan inherits the no-train posture.
Never paste into a consumer free-tier chat: unpublished or embargoed results, export-controlled output, or partner/NDA data.

The consumer vs commercial split is the whole game: consumer trains-by-default (opt out); commercial/API/Team/Enterprise/Gov/Edu don't, with ~30-day retention and zero-retention options.
Decide the trust boundary before connecting anything; document the approved plan; never paste restricted data into a consumer free-tier chat.
Subscriptions are personal-use seats with fair-use limits — automation/fleets likely belong on API/Enterprise terms.

Act 6

The deck as the demo

The thing you've been navigating all talk is itself the worked example.

This is Reveal.js 6: vertical sub-slides (the 3.2 you see bottom-right = URL #/3/2), live data-background-iframe content, and a self-contained offline build.
A horizontal spine everyone follows, with optional vertical descents that get more technical the deeper you press. You've been using the convention all talk — that is the artifact.
PDF handout via decktape (Reveal-aware, slide-by-slide) — which solves the live-iframe → PDF problem a screenshot can't.

Cover

Track
▾ L1

One move

Agents
▾ L1·L2·L3

Web/CLI
▾ L1·L2·L3

MCP
▾ L1·L2·L3

Skills
▾ L1·L3

…

horizontal spine → · press ↓ for depth

Honest scope: the research, persona review, and outline behind this deck were produced by the multi-agent workflow; these Reveal.js slides I assembled by hand from that material.

Offline-first: vendored Reveal.js, self-hosted fonts, no CDN at runtime → opens from a file or any static host.
decktape → a PDF handout that captures even live-iframe slides; verify slide-by-slide via Playwright.
Apply the ncar-brand-toolkit theme so brand & accessibility (contrast, logo, waves) are inherited, not hand-applied each time.
Promote the PDF build into CI / a hook so the live deck and the handout never drift apart.

Act 7

Honesty & close

What AI gets confidently wrong, what transfers, how I'd improve — and the evidence.

Verification

Code can be fluently, confidently wrong — a swapped axis, a wrong unit conversion, a plausible-but-wrong normalization that runs cleanly. Test units/ranges/conservation; check a known-good baseline; have it cite the formula; diff everything — never one check alone.

Reproducibility

Reproduce the checked-in artifact + environment, not the conversation. Commit code + a provenance record: model+version, effort, prompt, tools, date, diff.

Air-gap

There is no fully air-gapped Claude. If your data needs a true air gap, Claude Code is not your tool — full stop. Bedrock/Vertex give residency + no-train + BAA for restricted-but-not-air-gapped data.

Attribution

AI is not an author. Disclose the tool + model+version + what it did. Over-disclose. Keep prompt + diff provenance.

Treat AI output like a capable-but-unfamiliar collaborator's PR. Risk is highest where supervision is lowest — normalize verify-then-trust.

Methods-section recipe: cite model+version, state that AI assisted codegen, archive the prompt + diff in the repo/supplement.
Pin the model version explicitly (not "latest"); treat the prompt as a documented input.
Integrity line for students: the violation is undisclosed substitution of AI work for your own judgment — not use of the tool.

Route through Bedrock/Vertex for residency; sign enterprise agreements (BAA/DPA) for retention & jurisdiction; run least-privilege (scoped working dir, rootless container, no secrets in env); capture session logs for SIEM; control MCP config centrally.
Bound unattended autonomy: restrictive permission mode + tool allowlist + worktree sandbox + PreToolUse block-hooks — so "running while I sleep" is safe by construction.
Checkpoint long jobs to committed files, not session state — a hard restart then loses minutes, not hours.

If you're not a Claude user: the patterns are the lesson, the tool is just the vehicle

Transferable — most of the craft

Decomposition · role/persona design · context-setting · verify/critic loops · artifact-driven output · batching edits · treating output as reviewable.

Non-transferable plumbing

CLI/sub-agent mechanics · skill & plugin format · .mcp.json wiring · model names · effort knobs · prompt-dialect quirks.

Moving to Codex / Gemini / Antigravity: the concepts survive, the integration glue gets rewritten. Keep prompts declarative; treat vendor wiring as replaceable adapters. (I haven't personally used the others — but the patterns are the point.)

prompting-patterns sheet — the transferable prompt craft

Today	Improved
DAG lives in orchestrator prose	Versioned workflow schema + published plugin agents
Main-loop default maxed; no Haiku tier (sub-agents are already Sonnet/medium)	Modest main-loop default + a Haiku scout tier + token/hour telemetry
Collab wrapper is an idea, no artifact	Shipped — stable IDs, validated schema, diff-preview
Unattended autonomy lightly bounded	Least-privilege + worktree sandbox + PreToolUse block-hooks

Plus: validation at the seams — there's no built-in inter-agent correctness guarantee, so I add one.

These are discipline & economy gaps, not capability gaps. I'd genuinely like the room's better practices.

thank you

Recap & handouts

Decompose → assign altitude → concrete, inspectable hand-offs

That's the whole spine. Everything else was a variation on it. The talk was built around 13 high-return questions from a persona panel — they're the backbone of Q&A, so please ask them.

Scientist — research & artifacts Engineer / RSE — skills & CLI Security & governance Orchestrator — workflows Prompting patterns Newcomer & skeptic PI / manager — ROI & governance all handouts →

Victor Weeks · Research Software Engineer · NSF NCAR / RAL