1. ROI in One Card
Where AI genuinely pays off across a research group — and where it doesn't.
Where it pays off
- Time-to-PoC: scaffolding, prototypes, and exploratory analysis you'd otherwise defer.
- Boilerplate & glue: config, parsers, test scaffolds, data wrangling, format conversions.
- Documentation: docstrings, READMEs, tutorials, release/changelog notes.
- Review & comprehension: explaining unfamiliar code, first-pass review, onboarding.
- Exploration: surveying approaches, drafting alternatives, rubber-ducking design.
Where it doesn't (yet)
- Novel research judgment: what's worth doing, what a result means, scientific framing.
- Unverified correctness: numbers, proofs, citations, and edge cases still need a human.
- High-stakes, hard-to-catch errors: anywhere a confident-wrong answer is expensive.
2. Pilot & Measure
Decide with data, not vibes. Run a small, time-boxed pilot before any group-wide mandate.
- Pick 2–3 real tasks (one boilerplate-heavy, one analysis, one docs) and a few volunteers across skill levels.
- Set a baseline: roughly how long these take today and where quality problems show up.
- Time-box it (e.g., a few weeks) and define go/no-go criteria up front.
- Measure both sides: time saved and verification + rework cost, plus reviewer load.
- Capture qualitative signal: where it helped, wasted time, or felt risky.
Metrics worth tracking
- Cycle time to first working draft / PoC.
- Reviewer time per AI-assisted PR vs baseline.
- Defect / rework rate on AI-touched code.
- Adoption & sentiment — do people keep using it voluntarily?
3. Institutional ToU & Data Policy
What data may go to which surface depends on the plan and your institution. Verify both before approving.
| Surface | Typical use | What to confirm |
|---|---|---|
| claude.ai (chat) | Individual exploration, drafting | Plan tier; whether inputs may be used for training; retention window |
| Claude Code (CLI) | Coding in a repo on a dev machine | Same account/plan terms; what files/context it sends; local permissions |
| API / Enterprise / Team | Programmatic + org-governed use | Admin-managed retention, zero-retention options, contract/BAA terms |
- Defaults differ by plan and change over time — verify data-retention and training-use defaults against the current ToU.
- Enterprise/Team & API typically offer stronger controls (admin-managed retention, zero-data-retention options) — confirm what your contract actually includes.
- Classify the data: public / internal / confidential / regulated (PII, PHI, export-controlled, embargoed results) → "may / may not go to a model."
- Indirect egress counts: file uploads, MCP tool calls, and web fetches all send data outward.
4. Provenance, Disclosure & Citation
For citable, reproducible research, record how AI was used — not the chat transcript.
Record for each AI-assisted artifact
- Tool + model NAME and VERSION — pin a specific version, never "latest."
- Effort / reasoning level used (the setting, if any).
- The prompts / instructions that produced or shaped the artifact.
- Which tools / MCP servers were enabled during the run.
Reproduce the right thing
- Commit the checked-in artifact (code, config, processing script) + environment (lockfile/container) — that's what reproduces.
- The chat is provenance / audit, not the reproducible unit. Don't try to reproduce the conversation.
Methods-section disclosure line (example)
Portions of the analysis code were drafted with <Tool>
(<model>, version <X.Y>) and reviewed, tested, and edited by
the authors. The model was used to <scaffold the data-loading
pipeline / draft unit tests>; all results were verified against
<reference>.
5. Cost & Usage Governance
Limits and premiums are budget levers — set expectations and watch the burn.
| Lever | What to watch |
|---|---|
| Subscription usage budgets | Plans bundle usage; heavy users can exhaust caps. Know your plan's allowances. |
| Per-session & weekly limits | Stagger large runs so one person doesn't drain a shared cap mid-week. |
| Model premiums | Opus costs more per token than Sonnet/Haiku — reserve it for hard reasoning. |
| Shared-bucket effects | On shared/team plans, one runaway session can starve the group. |
| Who-pays | Decide grant vs department vs personal, and clarify it before rollout. |
Lightweight monitoring
- Ask the team to note model + effort in PRs ("Sonnet/medium" vs "Opus/high").
- Periodically review spend/usage dashboards where available.
- Cap autonomous/sub-agent loops; require checkpoints.
6. Responsible Use & Review
Speed without verification is a liability. Keep a human accountable for every output.
- Human-in-the-loop: a person reviews and signs off before AI-touched work is merged, published, or shipped.
- Verify before trust: numbers, citations, and correctness are checked against trusted sources — AI can be confidently wrong.
- Security review of AI-touched code: same bar as any contribution — secrets, injection, unsafe shell, over-broad permissions.
- MCP server vetting: treat MCP servers, plugins, and skills as untrusted dependencies — review source, pin versions, least-privilege credentials, keep an allow-list.
- Integrity: be clear that AI assisted; authors remain responsible for the work.
7. Rollout & Procurement
Adopt in stages; don't mandate before you've learned.
Pilot → Policy → Training
- Pilot: small group, real tasks, explicit go/no-go criteria (see Card 2).
- Policy: a short data-use + disclosure + cost policy grounded in your institution's ToU.
- Training: teach prompting, verification habits, and the policy — not just the tool.
Which roles get which tools
| Role | Likely fit |
|---|---|
| Research scientists | Chat (claude.ai) for analysis/drafting/lit work; light Claude Code on-ramp. |
| RSEs / developers | Claude Code in repos, with project permissions + review discipline. |
| Newcomers | Chat first, with strong verify-before-trust coaching. |
| PIs / managers | Drafting, summarizing, policy/grant prose; oversight of governance. |
8. Copyable Prompts
Paste these into Claude / Claude Code. Adapt names, data classes, and policy to your org.
Design a pilot plan
Help me design a 4-week pilot to evaluate an AI coding assistant for
my research group. Propose: 3 representative tasks, success/abort
criteria, baseline measurements, what to log, and a one-page report
template comparing time saved vs verification + rework cost.
Draft a data-classification table
Draft a green/yellow/red data-use table for my group mapping data
classes (public, internal, confidential, PII/PHI, export-controlled,
embargoed results) to allowed AI surfaces (chat vs in-repo CLI vs
enterprise/API). Leave retention/training specifics as TODOs for me
to confirm against our institution's current Terms of Use.
Write a methods-section disclosure line
Write a concise methods-section sentence disclosing that we used an
AI assistant. Include placeholders for tool name, model + version,
and exactly what it did (e.g., drafted code, wrote tests), and state
that the authors verified all results. Keep it journal-appropriate.
Build a provenance checklist for a result
Given this commit, list what we should record to make the AI's role
reproducible and citable: model name + pinned version, effort level,
the prompts used, enabled tools/MCP servers, and the environment
(lockfile/container). Output a short checklist I can attach to the PR.
Cost sanity check
Here is how our group uses the assistant (roles, typical tasks,
model choices). Identify where spend likely concentrates, where a
cheaper model or smaller context would cut cost without losing rigor,
and which lightweight usage signals to monitor. Do not assume specific
prices — tell me what to verify live.
9. Do / Avoid
Do
- Pilot and measure before any group-wide mandate.
- Map data classes to allowed surfaces; verify against institutional ToU.
- Pin model name + version in provenance; disclose AI use in methods.
- Reserve premium models (Opus) for hard reasoning; default to cheaper.
- Keep a human accountable for verifying every merged/published output.
- Vet MCP servers, plugins, and skills like any dependency.
- Decide who-pays and monitor spend lightly but regularly.
Avoid
- Mandating tools org-wide on vibes, before a pilot.
- Sending confidential/regulated/embargoed data to an unverified surface.
- Citing "latest" instead of a pinned model version.
- Treating the chat transcript as the reproducible artifact.
- Letting Opus + high effort run bulk mechanical work.
- Trusting AI numbers or citations without verification.
- Quoting prices/limits from memory instead of verifying live.
10. How This Could Be Better (Honest Notes)
This is a starting posture for group leaders, not a finished standard. Known gaps to keep improving:
- ROI measurement is hard: time saved is easy to feel and hard to attribute cleanly; verification cost is routinely under-counted.
- Disclosure norms are unsettled: journals and institutions differ and keep changing — re-check per submission.
- Provenance tooling is immature: capturing model + version, prompts, and enabled tools automatically is still largely manual today.
- Policy drift: Terms of Use, retention, pricing, and limits change across vendors — re-verify on a schedule, not once.