NSF NCAR — Research Applications Laboratory Draft · date TBD

AI-Augmented Research & Development

How I Use AI for
Research & Development

A working tour of the patterns I actually use with claude.ai and Claude Code — and the places I'd still improve them. Most of it is model-agnostic.

Victor Weeks  ·  Research Software Engineer
NSF NCAR  ·  Research Applications Laboratory (RAL)

This material is based upon work supported by the NSF National Center for Atmospheric Research, a major facility sponsored by the U.S. National Science Foundation and managed by the University Corporation for Atmospheric Research. This work is also supported by the Better Scientific Software Fellowship Program, funded by the U.S. Department of Energy and National Science Foundation.  #NSFfunded

Both

Choose your track

The top row is the whole story. Your track is just how deep you stop.

🔬 Scientist → claude.ai

Read & critique papers, run analysis, make figures. Stay on the top row; drop into the L1–L2 dives.

⚙️ Engineer → Claude Code

Multi-file changes, tests, agent orchestration in your repo. Press DOWN into the L3 floors.

Shared core (everyone)

Data & Terms of Use · tool security · model & effort · cost.

Press DOWN = more technical: L1 concept — what & why L2 practice — how, recipes L3 expert — config, code

Stay on the top row and you miss nothing required. The vertical descents are optional — take them on demand or save them for Q&A.   newcomer & skeptic sheet

L1L2L3
L1 · concept

How to read this deck

  • The horizontal axis is the story everyone follows. Vertical descents are optional deep dives.
  • One depth scheme everywhere: L1 concept → L2 how → L3 code/config.
  • You will not miss anything required by staying on the top row — I'll say so out loud each time.

Track tags appear on every slide: Both Scientist Engineer — so you always know whether a slide is aimed at you.

Act 1

The throughline: small teams of agents

One move underlies everything that follows. Learn it once, recognise it everywhere.

Both

The one move

Decompose → assign altitude → make the hand-off concrete

  • I don't "chat with an AI." I stand up a small, role-specialized team of agents.
  • Each agent gets a persona (its job, and its named blind spots) and a budgetaltitude = which model and how much effort.
  • Artifacts carry the state between agents: a review doc, an HTML deck, a JSON change-set, a security-warning file.

1 · Decompose

split the task into roles a fresh context can own

2 · Assign altitude

model × effort per role

3 · Concrete hand-off

an inspectable artifact, not a vibe

Most of what follows — tool security, skills, the collab wrapper, model routing — is a variation on this one move.

Both ↻ this deck

Sub-agents, workflows & personas

A sub-agent = a child Claude with its own context window and one scoped job

The diagram below is the actual pipeline that produced this deck's research and review (a DAG — a one-way graph of steps).

Orchestrator

plans & routes

Phase 1 · fan-out

9 discovery + 8 personas
Sonnet · medium

⊟ barrier

wait for all — synthesis can't start early

Phase 2 · synthesize

3 agents · Opus · high

Phase 3–4

build → self-critique

Fan-out turns latency into throughput — conditionally. 9 agents ≈ the time of one only below your usage-bucket ceiling and when the work is truly independent. Above the ceiling, throttling erases the wall-clock win and you still pay the extra tokens.
Isolation buys focus & context separation — not correctness, and not an OS security boundary. You add correctness at the seams (a schema, a critic agent, a human gate).

Personas are cheap reviewers with named blind spots.   orchestrator sheet

L1L2L3
L1 · concept

Why role separation works

  • Separation of concerns: a planner that can't write code can't paper over a bad design with a clever implementation.
  • A fresh-context reviewer isn't anchored by the author's rationalizations — it reads the work, not the intentions behind it.
  • Real examples: a two-persona documentation review (domain-expert vs. target-reader); science/engineering/UX review loops baked into a project's docs/phase-N/.

Scientists already know this shape: an independent-then-dependent pipeline of stages.

L1L2L3
L2 · practice

Designing a workflow in practice

  • Pick the roles, set the barriers, choose per-agent model + effort, and define the hand-off artifact.
  • Plan-then-implement split: a read-only architect emits a file:line blueprint → a fresh implementer executes it and verifies (e.g. runs the tests, or drives Playwright).
  • Validation lives at the seams — pick at least one: an assertion, a schema check, a critic agent, or a human gate before a result is consumed.
Orchestration buys you parallelism and separation. It does not buy reliability — you add that deliberately.
Engineer
L1L2L3
L3 · expert

Declarative workflows & the real overnight job

  • Today my DAG lives in orchestrator prose. The improvement: promote it to a versioned workflow schema — diffable, re-runnable, with per-agent model + effort pinned.
  • Pin roles as first-class agents/*.md definitions; publish them as plugin agents so they're portable, not retyped.
  • Harden the planner with plan mode + a read-only tool allowlist — enforced, not by convention.
  • A real run: a ~46-agent Sonnet review spread across 44 sessions, kept alive by a heartbeat/cron across the 5-hour window.
  • Cap reviewer count by marginal yield — a meta-agent that reports "new findings per added reviewer," so you stop when it flattens.

Honest beat: my 14- and 46-agent fan-outs are probably past the point of diminishing returns.

Act 2

Two surfaces: claude.ai vs Claude Code

Where the scientist and the engineer split — and the one question that tells you which to use.

Both

claude.ai vs Claude Code

Same model family, very different relationship to your files

claude.ai  (web · scientist home)Claude Code  (CLI · engineer home)
AccessConversational, self-contained chatReads & writes the real files in your repo
ToolsArtifacts you copy out by handRuns commands, tests, MCP tools, sub-agents
AutonomyOne turn at a timeIterates on its own; can be scheduled or driven remotely (start a run, check it from your phone)
VerificationYour eyeballsThe agent runs the tests / Playwright for real
Cost shapeLower per turn……but draws on the same weekly bucket

The tipping point

Am I copy-pasting code or file contents back and forth more than twice? → switch to the CLI.

Scientist, plainly

You mostly don't need the CLI. Its whole payoff is the file/command/repo loop you don't have — if you don't have that loop, web is your home.

L1L2L3
L1 · concept

When to use which

Research / explain / brainstorm

→ web chat

One-file change

→ CLI, single agent

Parallel or overnight work

→ CLI workflow with sub-agents

Teams: standardize on git as the shared artifact — web explorers and CLI engineers then converge in one history instead of two parallel worlds.

Scientist
L1L2L3
L2 · practice

A "things move fast" aside: response styles

  • I used to set response styles on claude.ai (Normal / Concise / Explanatory / Formal + custom) from the composer's "Use style" menu.
  • I can no longer reliably find it — at least not with Opus.
  • Honest read: it appears to be relocating/merging into Skills, not disappearing — but I'd verify live in my own account before relying on it.
This is a teaching moment, not a complaint: features move. Build habits that survive the UI changing under you.
Engineer
L1L2L3
L3 · expert

What Claude Code adds over the web

Web baseline

chat · copy-out artifacts · your eyeballs

Claude Code adds

filesystem read/edit · run tools & tests · MCP · plugins · hooks · sub-agents · scheduled + remote-controlled runs · self-verification via Playwright

  • The styles knob's spiritual successor in the CLI is declarative: CLAUDE.md + skills (including an output-style skill) shape response form. (The standalone "output styles" knob has been in flux and looks to be folding into skills — ⚠ verify live.)
  • Verification is the real divide: web = a human reads it; CLI = the agent runs it in a real environment.

Act 3

Extending Claude Code

MCP, skills, plugins — where the CLI stops being a chat box and becomes a platform. Security is the load-bearing slide.

Both

MCP servers: typed tools, not guesswork

MCP = Model Context Protocol — the model calls a typed tool against the live system

With MCP

model → typed tool call → live system → a real, current result

Without

model → guesses from training data → plausible, maybe stale or wrong

  • That's the gap between a plausible answer and a correct one — it matters when a scientist acts on the output.
  • My setup: 6 plugin MCP servers (Cloudflare API / bindings / docs / observability / builds, + Playwright), activated per-session via /mcp rather than left always-on.
Read this before you connect one: a local (stdio) MCP server runs as you, with your full OS permissions. Security is the next slide.   security & governance sheet
L1L2L3
L1 · concept

The MCP threat model

Surface 1 · the model

can be tricked into calling a tool — prompt injection, including text that arrives inside a tool's own output ("ignore previous instructions…").

Surface 2 · the tool

can itself be malicious — a supply-chain risk, like an unreviewed dependency or a hostile browser extension.

"Trust the model" is not a security boundary. Treat MCP tool output as untrusted input.

L1L2L3
L2 · practice

Using MCP securely: least privilege

Mitigations, cheapest & highest-value first

first-party servers · pin versions · read the source
scope to a project dir
no creds in the inherited env
prefer read-only tools
human approval for write/exec
keep an allowlist
  • Project-scoped .mcp.json is the secure default — a deck-building project simply has no Cloudflare API surface attached.
  • Least privilege limits the blast radius — what a manipulated model or a bad tool can actually reach.
Engineer
L1L2L3
L3 · expert

Defense in depth: the security plugin

security-guidance@claude-plugins-official — layer 1 observed directly; layers 2–3 inferred from behavior

① instant regex pattern warnings on Edit/Write (~25 patterns)
appears to run a model-backed diff review each turn, before you see the reply
appears to run a commit-time review that reasons across files on git commit
  • Real catches: an innerHTML XSS in a weather-avoidance tool; a newline-injection where attacker-controllable JSON reached a workflow comment and injected an executed directive.
  • High-recall, low-precision by design: it false-fired on the literal string torch.load(weights_only=False) sitting in prose — which is exactly why layers 2 & 3 exist.
  • Improvement: promote the commit review into a PreToolUse / commit hook that BLOCKS high-severity findings; curate the regex set per project to fight alarm fatigue.
Both

Skills: progressive disclosure

A packaged, versioned, auto-discovered prompt (+ optional scripts) loaded only when it's triggered

  • ~15 installed (point-in-time count from a ~/.claude scan); several are retrieval-first — notably the Cloudflare-platform skills.
  • Why retrieval-first matters: the Cloudflare API changes faster than the model's training, so the skill fetches current docs instead of trusting memory — fixing the failure mode of a confident, stale answer.
  • In-house example: I authored ncar-brand-toolkit@local — SVG waves, logo lookup, brand & accessibility rules. Institutional knowledge → an executable capability.

engineer / RSE sheet

L1L2L3
L1 · concept

Skill vs a good prompt

  • A skill is a packaged prompt. For a one-off, a well-crafted prompt is equivalent — and lower overhead.
  • Heuristic: prompt for personal/once, skill for shared/repeated/governed. Don't add the abstraction until duplication actually hurts.
  • Solo grad student → keep a prompt library in a repo. Team / RSE → the skill earns its keep as shared, maintained machinery.
Engineer
L1L2L3
L3 · expert

Authoring & triggering skills

  • A skill only helps if it fires — write trigger-quality descriptions, and use skill-creator's eval harness to measure & tune trigger accuracy.
  • A thin "house style" output skill can make HTML-artifact-first the default (see the Artifacts slide) instead of asking for it every time.
  • Right-size the catalog: ~11 Cloudflare skills loaded globally is dead weight on non-Cloudflare work — scope them per project.
Both

Plugins: one versioned container

How a team standardizes practice instead of re-teaching every engineer

One plugin =

skills
agents
commands
hooks
.mcp.json
  • One installable, versioned unit — the lever is shared governance, not the install count (~22 installed, point-in-time).
  • Improvement: publish my recurring roles — the two-persona reviewer, the sci/eng/UX loop — as plugin agents, not just docs. One command, reusable, demoable.
Engineer
L1L2L3
L3 · expert

Anatomy of a plugin

  • What goes in: agents/*.md, skills, slash-commands, hooks, .mcp.json.
  • Versioning + distribution = team governance: control which servers and skills are configured, centrally — ties straight to the governance discussion in Act 7.

This is the lever that makes my signature methods reproducible by someone who isn't me.

Act 4

Tuning the dials

Model & effort, artifact-driven output, and the one technique I most want to ship.

Both

Model & effort: route, don't max

Task typeModelEffort
Discovery · grep · triageHaiku → Sonnetlow–medium
Codegen · refactor · draftingSonnetmedium
Boilerplate · retrievalSonnetlow
Synthesis · architecture · judgmentOpushigh
Multi-step physics · legacy portsOpushigh–xhigh

Default: Sonnet + medium. Escalate only when the first pass is wrong, or when being wrong is expensive.

Keep two Opus ratios distinct: ~5× Sonnet per API token (the price ratio) ≠ the ~10–12× weekly-hours gap on a subscription (Opus burns more thinking tokens against a tighter cap). ⚠ both move — verify live.

The low / medium / high / xhigh ladder and /effort are CLI / API concepts; claude.ai is coarser — a model picker + an extended-thinking toggle.

L1L2L3
L1 · concept

Cost/quality routing

  • Same instinct as not running a coarse sensitivity sweep at full production resolution.
  • Recall problems (discovery, grep, triage) → Sonnet/medium, run many in parallel.
  • Precision problems (synthesis, architecture, judgment) → Opus/high, where one bad call poisons everything downstream.

Scientists get this immediately through the grid-resolution analogy: match the resolution to the question.

Engineer
L1L2L3
L3 · expert

My settings & the honest gap

  • My globals are maxed: effortLevel: xhigh, alwaysThinkingEnabled: true. Haiku never appears.
  • The precise gap — and it's narrower than it looks: my documented discovery sub-agents are already correctly tiered (Sonnet/medium). The two real leaks are exactly (a) no Haiku tier at all and (b) a maxed main-loop / orchestrator default.
  • Fix: set a modest main-loop default, add a Haiku scout tier (Haiku scout → Sonnet worker → Opus judge), raise /effort only where precision pays.
  • Demonstrate a deliberate low-effort run so the room sees where it's plenty — calibration beats "always max."
skipWorkflowUsageWarning: true only suppresses the one-time pre-launch dialog for multi-agent runs — it is not a quota meter and says nothing about being near a limit. Easy to misread.
Both

Artifact-driven interaction: HTML + SVG

For a dense answer, ask for a self-contained HTML artifact instead of a wall of prose

  • A dense answer becomes something you can act on: scrub an animation, sort a table, toggle a parameter — you build intuition far faster than by reading linearly.
  • SVG is resolution-independent and diffable as text, and the model authors it directly.

claude.ai · Artifacts

a sandboxed, size-limited preview pane — great for quick explainers you copy out.

Claude Code · on disk

writes a self-contained file to disk — no size sandbox. Opens offline, embeds in a deck, attaches to a PR.

Live, in this slide → fan-out width: 3 agents

That slider is the technique demonstrating itself. Every diagram in this deck is the same pattern.   scientist sheet

L1L2L3
L2 · practice

Concrete artifact genres

  • A 9-section path-finding explainer: embedded SVG + an interactive heading-rotation toy + a side-by-side search animation.
  • A per-PR HTML review aid that renders the diff with margin annotations and severity colors.
  • Reveal.js slides that embed live artifacts — like the one you're in.
  • A self-contained flat-vector explainer comic: The Collab Loop.

Template the recurring genres (explainer, PR-review aid) as skills or frontend-design templates so they're one ask, not a rebuild.

Engineer
L1L2L3
L3 · expert

Weight & safety of artifacts

  • Mind the weight: a 17.6 MB single-file Leaflet viewer (35 flight×CDO combos pre-serialized) is portable but a perf/review liability. Note: that's a Claude Code on-disk file — it would blow past claude.ai's Artifact size limit. Offer lazy-load / split-data variants.
  • Artifacts are a security surface: the innerHTML XSS catch from Act 3 lived inside one. Pair "request an artifact" with "use safe DOM construction."
A convenience pattern must not quietly teach an unsafe one.
Both ★ not yet shipped

The collab wrapper: point, don't describe

An idea I have a design for, not a tool I've built — here's exactly what it is and why I want it

① click any element
② comment / edit its text
③ export one JSON change-set
④ one batch follow-up prompt
⑤ all edits applied as one diff
  • It fixes the pointing problem — "make that heading smaller — no, the other one" costs several exchanges in prose; a click makes it one.
  • It would collapse N round-trips into one structured, batched pass, and the JSON is an audit trail of exactly what changed.
Honest status: precisely articulated, not yet built — the worktree is near-empty. This is the thing I most want to ship.   orchestrator sheet
L1L2L3
L1 · concept

Why it would work: pointing & batching

  • Pointing: let the human click instead of describe → a fuzzy spatial request becomes a machine-addressable, element-anchored change.
  • Batching: one context-rich pass over 20 edits is cheaper and more coherent than 20 round-trips.
  • It's a lightweight typed protocol for design feedback — like MCP, but human → agent.

For a scientist that's quota saved; for a reviewer it's an archived change-set — the JSON is the record.

L1L2L3
L2 · practice

Robustness, audit trail, accessibility

  • Robustness: positional references break if the page is reordered between export and apply → anchor to stable IDs and validate before committing.
  • Audit trail: the JSON archives next to the committed diff — a precise record of what changed and why.
  • Hygiene: for unpublished work, remember the changed content round-trips through the model.
  • Accessibility: using a provided wrapper needs no coding; building one does — presenter ships the wrapper, the scientist just clicks and exports.
Engineer
L1L2L3
L3 · expert

Build plan & schema

  • Anchor each comment to a stable selector / data-collab-id injected at wrap time, so edits survive a DOM re-render.
  • Develop in an isolated worktree so the wrapper iterates without touching the deck it wraps.
  • Make the JSON a versioned schema, validated on apply, with a dry-run / diff preview before changes land.
  • Alternative ingest: a Playwright / claude-in-chrome path that reads the annotated DOM directly.
↻ the payoff: wrap THIS deck with the tool and iterate live on stage

Act 5

Economics & honest limits

What it costs, how the buckets work, and the data question to settle before you connect anything.

Both

Co-development modalities & their token cost

ModalityThe tradeRelative tokens
a · copy/paste chatmax human effort, can't verify, lowest risk low
b · inline autocompleteexternal tool, not Claude Code — Copilot / Cursor Tab / Windsurf; high frequency moderate
c · CLI single-agentone rich context, many tool-call turns mod–high
d · CLI workflow + sub-agentsmin labor, max capability, hardest to inspect highest

Decision rule: research → web chat (a); one-file change → (c); parallel/overnight → (d). I live in (d); tiering models inside it is what keeps it affordable.

LSP (pyright/clangd/gopls) is a different thing entirely: deterministic, zero-token code intelligence for the agent (go-to-def, types, diagnostics) — not a model-based completion modality. (b) isn't a Claude Code modality at all.
L1L2L3
L2 · practice

Token implications per modality

  • (a) low–moderate per turn, but re-pastes context every turn (waste) + high human cost.
  • (b) low per suggestion, high frequency → moderate aggregate.
  • (c) moderate–high: one rich context, many tool-call turns.
  • (d) highest: N concurrent contexts × extended thinking — and the biggest lever you control is Haiku/Sonnet/Opus tiering.

Silent budget killers (Q7): a long context re-sent every turn, and orchestration overhead you can't see. The fix is per-workflow token/hour telemetry — see Act 7.

Both

The cost model — Claude vs Codex vs Antigravity

Every figure here is a placeholder — verify live before you present. Teach the framework, not the number; prefer a live /status check to a quoted limit.
ClaudeCodex (OpenAI)Antigravity (Google)
Short window~5-hr rolling5-hr message rangesagent-request based
Long windowweekly + a separate Opus capcredit rate card~tripled at I/O 2026
Meteringtokens × premium × efforttokens (since 2026-04-02)requests
Training defaultconsumer opt-out ⚠verifyverify
To use Claude modelsneeds your own Anthropic API key (a prerequisite, not an overage valve)

What it costs (⚠ verify live)

Pro $20/mo · Max 5× $100 · Max 20× $200. On a subscription, Opus drains the weekly budget ~10–12× faster than Sonnet (≈480 Sonnet vs ≈40 Opus hrs on Max 20× — verify). Claude shares one bucket across Claude Code + claude.ai + Cowork.

L1L2L3
L1 · concept

Two buckets, everywhere

Bucket 1 · ~5-hour rolling

the burst limit most interactive users hit first.

Bucket 2 · weekly cap

rolling 7-day — Claude has two: overall + a separate Opus cap.

"Suddenly less helpful mid-conversation" = you hit a window. Design long jobs to be resumable, not all-or-nothing. Check it in-product: /status.

L1L2L3
L2 · practice

Vendor specifics, dated

  • Claude: Pro $20 / Max 5× $100 / Max 20× $200; weekly published as expected hours (Max20× ≈ 240–480 Sonnet / 24–40 Opus hrs ⚠). 5-hour limits doubled in 2026.
  • Codex: token-based since 2026-04-02; per-5-hour message ranges + a credit rate card per 1M tokens; images drain several× faster; buy extra credits at the cap.
  • Antigravity: free public preview, agent-request based, limits ~tripled at I/O 2026; using Claude models reportedly needs your own Anthropic API key ⚠.

Every number gets a source + an access date. The point is that the checklist exists, not the digits.

Engineer
L1L2L3
L3 · improvement

Instrument your own usage

  • Add per-workflow token/hour telemetry: capture real consumption per run (tokens/hours by model tier) and show a measured cost breakdown — e.g. of building this very deck.
  • Adopt the Haiku tier as the single budget move that most extends the weekly cap.
A measured cost breakdown beats any vendor headline number. (skipWorkflowUsageWarning is unrelated — it hides a dialog, not a quota.)
Both read this first

Data & Terms of Use: settle this before you connect anything

Consumer claude.ai (Free/Pro/Max)

Trains on your conversations by default — you must opt out.

Commercial / API / Team / Enterprise / Gov / Edu

Not trained by default · ~30-day retention · zero-retention agreements available.

Safety-review carve-out: flagged conversations may still be used regardless of your opt-out — verify the live wording.
  • Decide your trust boundary first. Document which plan/surface your group is approved to use. Claude Code on a commercial/API plan inherits the no-train posture.
  • Never paste into a consumer free-tier chat: unpublished or embargoed results, export-controlled output, or partner/NDA data.
Scientist
L1L2L3
L2 · practice

Terms of Use & data, in practice

  • The consumer vs commercial split is the whole game: consumer trains-by-default (opt out); commercial/API/Team/Enterprise/Gov/Edu don't, with ~30-day retention and zero-retention options.
  • Decide the trust boundary before connecting anything; document the approved plan; never paste restricted data into a consumer free-tier chat.
  • Subscriptions are personal-use seats with fair-use limits — automation/fleets likely belong on API/Enterprise terms.

Act 6

The deck as the demo

The thing you've been navigating all talk is itself the worked example.

Both ↻ this deck

The deck explains itself

  • This is Reveal.js 6: vertical sub-slides (#/4/1, #/4/2), live data-background-iframe content, and a self-contained offline build.
  • A horizontal spine everyone follows, with optional vertical descents that get more technical the deeper you press. You've been using the convention all talk — that is the artifact.
  • PDF handout via decktape (Reveal-aware, slide-by-slide) — which solves the live-iframe → PDF problem a screenshot can't.
Cover
Track
▾ L1
One move
Agents
▾ L1·L2·L3
Web/CLI
▾ L1·L2·L3
MCP
▾ L1·L2·L3
Skills
▾ L1·L3
Close

horizontal spine → · press ↓ for depth

Honest scope: the research, persona review, and outline behind this deck were produced by the multi-agent workflow; these Reveal.js slides I assembled by hand from that material.
Engineer
L1L2L3
L3 · expert

Build & verify pipeline

  • Offline-first: vendored Reveal.js, self-hosted fonts, no CDN at runtime → opens from a file or any static host.
  • decktape → a PDF handout that captures even live-iframe slides; verify slide-by-slide via Playwright.
  • Apply the ncar-brand-toolkit theme so brand & accessibility (contrast, logo, waves) are inherited, not hand-applied each time.
  • Promote the PDF build into CI / a hook so the live deck and the handout never drift apart.

Act 7

Honesty & close

What AI gets confidently wrong, what transfers, how I'd improve — and the evidence.

Both

The things AI gets confidently wrong

Verification

Code can be fluently, confidently wrong — a swapped axis, a wrong unit conversion, a plausible-but-wrong normalization that runs cleanly. Test units/ranges/conservation; check a known-good baseline; have it cite the formula; diff everything — never one check alone.

Reproducibility

Reproduce the checked-in artifact + environment, not the conversation. Commit code + a provenance record: model+version, effort, prompt, tools, date, diff.

Air-gap

There is no fully air-gapped Claude. If your data needs a true air gap, Claude Code is not your tool — full stop. Bedrock/Vertex give residency + no-train + BAA for restricted-but-not-air-gapped data.

Attribution

AI is not an author. Disclose the tool + model+version + what it did. Over-disclose. Keep prompt + diff provenance.

Treat AI output like a capable-but-unfamiliar collaborator's PR. Risk is highest where supervision is lowest — normalize verify-then-trust.

Scientist
L1L2L3
L2 · practice

Provenance & disclosure recipe

  • Methods-section recipe: cite model+version, state that AI assisted codegen, archive the prompt + diff in the repo/supplement.
  • Pin the model version explicitly (not "latest"); treat the prompt as a documented input.
  • Integrity line for students: the violation is undisclosed substitution of AI work for your own judgment — not use of the tool.
Engineer
L1L2L3
L3 · expert

Governance & deployment reality

  • Route through Bedrock/Vertex for residency; sign enterprise agreements (BAA/DPA) for retention & jurisdiction; run least-privilege (scoped working dir, rootless container, no secrets in env); capture session logs for SIEM; control MCP config centrally.
  • Bound unattended autonomy: restrictive permission mode + tool allowlist + worktree sandbox + PreToolUse block-hooks — so "running while I sleep" is safe by construction.
  • Checkpoint long jobs to committed files, not session state — a hard restart then loses minutes, not hours.
Both

What transfers, what's lock-in

If you're not a Claude user: the patterns are the lesson, the tool is just the vehicle

Transferable — most of the craft

Decomposition · role/persona design · context-setting · verify/critic loops · artifact-driven output · batching edits · treating output as reviewable.

Non-transferable plumbing

CLI/sub-agent mechanics · skill & plugin format · .mcp.json wiring · model names · effort knobs · prompt-dialect quirks.

Moving to Codex / Gemini / Antigravity: the concepts survive, the integration glue gets rewritten. Keep prompts declarative; treat vendor wiring as replaceable adapters. (I haven't personally used the others — but the patterns are the point.)

prompting-patterns sheet — the transferable prompt craft

Both

How I'd improve this workflow

TodayImproved
DAG lives in orchestrator proseVersioned workflow schema + published plugin agents
Main-loop default maxed; no Haiku tier (sub-agents are already Sonnet/medium)Modest main-loop default + a Haiku scout tier + token/hour telemetry
Collab wrapper is an idea, no artifactShipped — stable IDs, validated schema, diff-preview
Unattended autonomy lightly boundedLeast-privilege + worktree sandbox + PreToolUse block-hooks

Plus: validation at the seams — there's no built-in inter-agent correctness guarantee, so I add one.

These are discipline & economy gaps, not capability gaps. I'd genuinely like the room's better practices.

NSF NCAR — Research Applications Laboratory thank you

Recap & handouts

Decompose → assign altitude → concrete, inspectable hand-offs

That's the whole spine. Everything else was a variation on it. The talk was built around 13 high-return questions from a persona panel — they're the backbone of Q&A, so please ask them.

Victor Weeks · Research Software Engineer · NSF NCAR / RAL  ·  #NSFfunded

This material is based upon work supported by the NSF National Center for Atmospheric Research, a major facility sponsored by the U.S. National Science Foundation and managed by the University Corporation for Atmospheric Research. This work is also supported by the Better Scientific Software Fellowship Program, funded by the U.S. Department of Energy and National Science Foundation.