PI / Research-Group Manager — AI-Augmented Development Cheat Sheet

1. ROI in One Card

Where AI genuinely pays off across a research group — and where it doesn't.

Where it pays off

Time-to-PoC: scaffolding, prototypes, and exploratory analysis you'd otherwise defer.
Boilerplate & glue: config, parsers, test scaffolds, data wrangling, format conversions.
Documentation: docstrings, READMEs, tutorials, release/changelog notes.
Review & comprehension: explaining unfamiliar code, first-pass review, onboarding.
Exploration: surveying approaches, drafting alternatives, rubber-ducking design.

Where it doesn't (yet)

Novel research judgment: what's worth doing, what a result means, scientific framing.
Unverified correctness: numbers, proofs, citations, and edge cases still need a human.
High-stakes, hard-to-catch errors: anywhere a confident-wrong answer is expensive.

Tip Treat AI as a fast junior collaborator: strong on volume and first drafts, not accountable for truth. The PI still owns the science.

2. Pilot & Measure

Decide with data, not vibes. Run a small, time-boxed pilot before any group-wide mandate.

Pick 2–3 real tasks (one boilerplate-heavy, one analysis, one docs) and a few volunteers across skill levels.
Set a baseline: roughly how long these take today and where quality problems show up.
Time-box it (e.g., a few weeks) and define go/no-go criteria up front.
Measure both sides: time saved and verification + rework cost, plus reviewer load.
Capture qualitative signal: where it helped, wasted time, or felt risky.

Metrics worth tracking

Cycle time to first working draft / PoC.
Reviewer time per AI-assisted PR vs baseline.
Defect / rework rate on AI-touched code.
Adoption & sentiment — do people keep using it voluntarily?

Caution Beware vanity metrics. "Lines generated" or "prompts run" measure activity, not value. Net time including verification is what matters.

3. Institutional ToU & Data Policy

What data may go to which surface depends on the plan and your institution. Verify both before approving.

Surface	Typical use	What to confirm
claude.ai (chat)	Individual exploration, drafting	Plan tier; whether inputs may be used for training; retention window
Claude Code (CLI)	Coding in a repo on a dev machine	Same account/plan terms; what files/context it sends; local permissions
API / Enterprise / Team	Programmatic + org-governed use	Admin-managed retention, zero-retention options, contract/BAA terms

Defaults differ by plan and change over time — verify data-retention and training-use defaults against the current ToU.
Enterprise/Team & API typically offer stronger controls (admin-managed retention, zero-data-retention options) — confirm what your contract actually includes.
Classify the data: public / internal / confidential / regulated (PII, PHI, export-controlled, embargoed results) → "may / may not go to a model."
Indirect egress counts: file uploads, MCP tool calls, and web fetches all send data outward.

Caution Defaults, retention windows, and training-use terms are volatile and plan-specific. Always verify with your institution's policy and the current Terms of Use — do not rely on this sheet or on memory.

Tip Publish a one-page green/yellow/red table mapping data class → allowed surface, so no one re-decides every session.

4. Provenance, Disclosure & Citation

For citable, reproducible research, record how AI was used — not the chat transcript.

Record for each AI-assisted artifact

Tool + model NAME and VERSION — pin a specific version, never "latest."
Effort / reasoning level used (the setting, if any).
The prompts / instructions that produced or shaped the artifact.
Which tools / MCP servers were enabled during the run.

Reproduce the right thing

Commit the checked-in artifact (code, config, processing script) + environment (lockfile/container) — that's what reproduces.
The chat is provenance / audit, not the reproducible unit. Don't try to reproduce the conversation.

Methods-section disclosure line (example)

Portions of the analysis code were drafted with <Tool>
(<model>, version <X.Y>) and reviewed, tested, and edited by
the authors. The model was used to <scaffold the data-loading
pipeline / draft unit tests>; all results were verified against
<reference>.

Tip Check the target venue/journal and your institution for AI-disclosure requirements — they vary and change. Pin the version so the methods line stays reproducible.

5. Cost & Usage Governance

Limits and premiums are budget levers — set expectations and watch the burn.

Lever	What to watch
Subscription usage budgets	Plans bundle usage; heavy users can exhaust caps. Know your plan's allowances.
Per-session & weekly limits	Stagger large runs so one person doesn't drain a shared cap mid-week.
Model premiums	Opus costs more per token than Sonnet/Haiku — reserve it for hard reasoning.
Shared-bucket effects	On shared/team plans, one runaway session can starve the group.
Who-pays	Decide grant vs department vs personal, and clarify it before rollout.

Lightweight monitoring

Ask the team to note model + effort in PRs ("Sonnet/medium" vs "Opus/high").
Periodically review spend/usage dashboards where available.
Cap autonomous/sub-agent loops; require checkpoints.

Caution Pricing, limits, and what's bundled change frequently and differ by plan. Verify current pricing and limits live before budgeting — don't quote numbers from memory.

6. Responsible Use & Review

Speed without verification is a liability. Keep a human accountable for every output.

Human-in-the-loop: a person reviews and signs off before AI-touched work is merged, published, or shipped.
Verify before trust: numbers, citations, and correctness are checked against trusted sources — AI can be confidently wrong.
Security review of AI-touched code: same bar as any contribution — secrets, injection, unsafe shell, over-broad permissions.
MCP server vetting: treat MCP servers, plugins, and skills as untrusted dependencies — review source, pin versions, least-privilege credentials, keep an allow-list.
Integrity: be clear that AI assisted; authors remain responsible for the work.

Tip Make "who verified this, and how" a required field in review — just like a second reviewer's sign-off.

7. Rollout & Procurement

Adopt in stages; don't mandate before you've learned.

Pilot → Policy → Training

Pilot: small group, real tasks, explicit go/no-go criteria (see Card 2).
Policy: a short data-use + disclosure + cost policy grounded in your institution's ToU.
Training: teach prompting, verification habits, and the policy — not just the tool.

Which roles get which tools

Role	Likely fit
Research scientists	Chat (claude.ai) for analysis/drafting/lit work; light Claude Code on-ramp.
RSEs / developers	Claude Code in repos, with project permissions + review discipline.
Newcomers	Chat first, with strong verify-before-trust coaching.
PIs / managers	Drafting, summarizing, policy/grant prose; oversight of governance.

Tip Match the tool to the workflow, not seniority. Buy for the pilot's scope first; expand seats/tiers after you've measured value and written policy.

8. Copyable Prompts

Paste these into Claude / Claude Code. Adapt names, data classes, and policy to your org.

Design a pilot plan

Help me design a 4-week pilot to evaluate an AI coding assistant for
my research group. Propose: 3 representative tasks, success/abort
criteria, baseline measurements, what to log, and a one-page report
template comparing time saved vs verification + rework cost.

Draft a data-classification table

Draft a green/yellow/red data-use table for my group mapping data
classes (public, internal, confidential, PII/PHI, export-controlled,
embargoed results) to allowed AI surfaces (chat vs in-repo CLI vs
enterprise/API). Leave retention/training specifics as TODOs for me
to confirm against our institution's current Terms of Use.

Write a methods-section disclosure line

Write a concise methods-section sentence disclosing that we used an
AI assistant. Include placeholders for tool name, model + version,
and exactly what it did (e.g., drafted code, wrote tests), and state
that the authors verified all results. Keep it journal-appropriate.

Build a provenance checklist for a result

Given this commit, list what we should record to make the AI's role
reproducible and citable: model name + pinned version, effort level,
the prompts used, enabled tools/MCP servers, and the environment
(lockfile/container). Output a short checklist I can attach to the PR.

Cost sanity check

Here is how our group uses the assistant (roles, typical tasks,
model choices). Identify where spend likely concentrates, where a
cheaper model or smaller context would cut cost without losing rigor,
and which lightweight usage signals to monitor. Do not assume specific
prices — tell me what to verify live.

Tip These prompts are largely model-agnostic — the same wording transfers to other AI tools. Keep them in a shared snippets file so governance doesn't depend on who's driving.

9. Do / Avoid

Do

Pilot and measure before any group-wide mandate.
Map data classes to allowed surfaces; verify against institutional ToU.
Pin model name + version in provenance; disclose AI use in methods.
Reserve premium models (Opus) for hard reasoning; default to cheaper.
Keep a human accountable for verifying every merged/published output.
Vet MCP servers, plugins, and skills like any dependency.
Decide who-pays and monitor spend lightly but regularly.

Avoid

Mandating tools org-wide on vibes, before a pilot.
Sending confidential/regulated/embargoed data to an unverified surface.
Citing "latest" instead of a pinned model version.
Treating the chat transcript as the reproducible artifact.
Letting Opus + high effort run bulk mechanical work.
Trusting AI numbers or citations without verification.
Quoting prices/limits from memory instead of verifying live.

10. How This Could Be Better (Honest Notes)

This is a starting posture for group leaders, not a finished standard. Known gaps to keep improving:

ROI measurement is hard: time saved is easy to feel and hard to attribute cleanly; verification cost is routinely under-counted.
Disclosure norms are unsettled: journals and institutions differ and keep changing — re-check per submission.
Provenance tooling is immature: capturing model + version, prompts, and enabled tools automatically is still largely manual today.
Policy drift: Terms of Use, retention, pricing, and limits change across vendors — re-verify on a schedule, not once.

Tip Treat this sheet as a living document. As your group learns a stronger practice, fold it back in — the goal is better governance and better science, not just recording current habits.