AI-Augmented Development & Research Tutorial

PI / Research-Group Manager Cheat Sheet

Who this is for: principal investigators and managers weighing ROI, governance, and responsible/citable use of AI across a research group.

1. ROI in One Card

Where AI genuinely pays off across a research group — and where it doesn't.

Where it pays off

  • Time-to-PoC: scaffolding, prototypes, and exploratory analysis you'd otherwise defer.
  • Boilerplate & glue: config, parsers, test scaffolds, data wrangling, format conversions.
  • Documentation: docstrings, READMEs, tutorials, release/changelog notes.
  • Review & comprehension: explaining unfamiliar code, first-pass review, onboarding.
  • Exploration: surveying approaches, drafting alternatives, rubber-ducking design.

Where it doesn't (yet)

  • Novel research judgment: what's worth doing, what a result means, scientific framing.
  • Unverified correctness: numbers, proofs, citations, and edge cases still need a human.
  • High-stakes, hard-to-catch errors: anywhere a confident-wrong answer is expensive.
Tip Treat AI as a fast junior collaborator: strong on volume and first drafts, not accountable for truth. The PI still owns the science.

2. Pilot & Measure

Decide with data, not vibes. Run a small, time-boxed pilot before any group-wide mandate.

  • Pick 2–3 real tasks (one boilerplate-heavy, one analysis, one docs) and a few volunteers across skill levels.
  • Set a baseline: roughly how long these take today and where quality problems show up.
  • Time-box it (e.g., a few weeks) and define go/no-go criteria up front.
  • Measure both sides: time saved and verification + rework cost, plus reviewer load.
  • Capture qualitative signal: where it helped, wasted time, or felt risky.

Metrics worth tracking

  • Cycle time to first working draft / PoC.
  • Reviewer time per AI-assisted PR vs baseline.
  • Defect / rework rate on AI-touched code.
  • Adoption & sentiment — do people keep using it voluntarily?
Caution Beware vanity metrics. "Lines generated" or "prompts run" measure activity, not value. Net time including verification is what matters.

3. Institutional ToU & Data Policy

What data may go to which surface depends on the plan and your institution. Verify both before approving.

SurfaceTypical useWhat to confirm
claude.ai (chat)Individual exploration, draftingPlan tier; whether inputs may be used for training; retention window
Claude Code (CLI)Coding in a repo on a dev machineSame account/plan terms; what files/context it sends; local permissions
API / Enterprise / TeamProgrammatic + org-governed useAdmin-managed retention, zero-retention options, contract/BAA terms
  • Defaults differ by plan and change over time — verify data-retention and training-use defaults against the current ToU.
  • Enterprise/Team & API typically offer stronger controls (admin-managed retention, zero-data-retention options) — confirm what your contract actually includes.
  • Classify the data: public / internal / confidential / regulated (PII, PHI, export-controlled, embargoed results) → "may / may not go to a model."
  • Indirect egress counts: file uploads, MCP tool calls, and web fetches all send data outward.
Caution Defaults, retention windows, and training-use terms are volatile and plan-specific. Always verify with your institution's policy and the current Terms of Use — do not rely on this sheet or on memory.
Tip Publish a one-page green/yellow/red table mapping data class → allowed surface, so no one re-decides every session.

4. Provenance, Disclosure & Citation

For citable, reproducible research, record how AI was used — not the chat transcript.

Record for each AI-assisted artifact

  • Tool + model NAME and VERSION — pin a specific version, never "latest."
  • Effort / reasoning level used (the setting, if any).
  • The prompts / instructions that produced or shaped the artifact.
  • Which tools / MCP servers were enabled during the run.

Reproduce the right thing

  • Commit the checked-in artifact (code, config, processing script) + environment (lockfile/container) — that's what reproduces.
  • The chat is provenance / audit, not the reproducible unit. Don't try to reproduce the conversation.

Methods-section disclosure line (example)

Portions of the analysis code were drafted with <Tool>
(<model>, version <X.Y>) and reviewed, tested, and edited by
the authors. The model was used to <scaffold the data-loading
pipeline / draft unit tests>; all results were verified against
<reference>.
Tip Check the target venue/journal and your institution for AI-disclosure requirements — they vary and change. Pin the version so the methods line stays reproducible.

5. Cost & Usage Governance

Limits and premiums are budget levers — set expectations and watch the burn.

LeverWhat to watch
Subscription usage budgetsPlans bundle usage; heavy users can exhaust caps. Know your plan's allowances.
Per-session & weekly limitsStagger large runs so one person doesn't drain a shared cap mid-week.
Model premiumsOpus costs more per token than Sonnet/Haiku — reserve it for hard reasoning.
Shared-bucket effectsOn shared/team plans, one runaway session can starve the group.
Who-paysDecide grant vs department vs personal, and clarify it before rollout.

Lightweight monitoring

  • Ask the team to note model + effort in PRs ("Sonnet/medium" vs "Opus/high").
  • Periodically review spend/usage dashboards where available.
  • Cap autonomous/sub-agent loops; require checkpoints.
Caution Pricing, limits, and what's bundled change frequently and differ by plan. Verify current pricing and limits live before budgeting — don't quote numbers from memory.

6. Responsible Use & Review

Speed without verification is a liability. Keep a human accountable for every output.

  • Human-in-the-loop: a person reviews and signs off before AI-touched work is merged, published, or shipped.
  • Verify before trust: numbers, citations, and correctness are checked against trusted sources — AI can be confidently wrong.
  • Security review of AI-touched code: same bar as any contribution — secrets, injection, unsafe shell, over-broad permissions.
  • MCP server vetting: treat MCP servers, plugins, and skills as untrusted dependencies — review source, pin versions, least-privilege credentials, keep an allow-list.
  • Integrity: be clear that AI assisted; authors remain responsible for the work.
Tip Make "who verified this, and how" a required field in review — just like a second reviewer's sign-off.

7. Rollout & Procurement

Adopt in stages; don't mandate before you've learned.

Pilot → Policy → Training

  • Pilot: small group, real tasks, explicit go/no-go criteria (see Card 2).
  • Policy: a short data-use + disclosure + cost policy grounded in your institution's ToU.
  • Training: teach prompting, verification habits, and the policy — not just the tool.

Which roles get which tools

RoleLikely fit
Research scientistsChat (claude.ai) for analysis/drafting/lit work; light Claude Code on-ramp.
RSEs / developersClaude Code in repos, with project permissions + review discipline.
NewcomersChat first, with strong verify-before-trust coaching.
PIs / managersDrafting, summarizing, policy/grant prose; oversight of governance.
Tip Match the tool to the workflow, not seniority. Buy for the pilot's scope first; expand seats/tiers after you've measured value and written policy.

8. Copyable Prompts

Paste these into Claude / Claude Code. Adapt names, data classes, and policy to your org.

Design a pilot plan

Help me design a 4-week pilot to evaluate an AI coding assistant for
my research group. Propose: 3 representative tasks, success/abort
criteria, baseline measurements, what to log, and a one-page report
template comparing time saved vs verification + rework cost.

Draft a data-classification table

Draft a green/yellow/red data-use table for my group mapping data
classes (public, internal, confidential, PII/PHI, export-controlled,
embargoed results) to allowed AI surfaces (chat vs in-repo CLI vs
enterprise/API). Leave retention/training specifics as TODOs for me
to confirm against our institution's current Terms of Use.

Write a methods-section disclosure line

Write a concise methods-section sentence disclosing that we used an
AI assistant. Include placeholders for tool name, model + version,
and exactly what it did (e.g., drafted code, wrote tests), and state
that the authors verified all results. Keep it journal-appropriate.

Build a provenance checklist for a result

Given this commit, list what we should record to make the AI's role
reproducible and citable: model name + pinned version, effort level,
the prompts used, enabled tools/MCP servers, and the environment
(lockfile/container). Output a short checklist I can attach to the PR.

Cost sanity check

Here is how our group uses the assistant (roles, typical tasks,
model choices). Identify where spend likely concentrates, where a
cheaper model or smaller context would cut cost without losing rigor,
and which lightweight usage signals to monitor. Do not assume specific
prices — tell me what to verify live.
Tip These prompts are largely model-agnostic — the same wording transfers to other AI tools. Keep them in a shared snippets file so governance doesn't depend on who's driving.

9. Do / Avoid

Do

  • Pilot and measure before any group-wide mandate.
  • Map data classes to allowed surfaces; verify against institutional ToU.
  • Pin model name + version in provenance; disclose AI use in methods.
  • Reserve premium models (Opus) for hard reasoning; default to cheaper.
  • Keep a human accountable for verifying every merged/published output.
  • Vet MCP servers, plugins, and skills like any dependency.
  • Decide who-pays and monitor spend lightly but regularly.

Avoid

  • Mandating tools org-wide on vibes, before a pilot.
  • Sending confidential/regulated/embargoed data to an unverified surface.
  • Citing "latest" instead of a pinned model version.
  • Treating the chat transcript as the reproducible artifact.
  • Letting Opus + high effort run bulk mechanical work.
  • Trusting AI numbers or citations without verification.
  • Quoting prices/limits from memory instead of verifying live.

10. How This Could Be Better (Honest Notes)

This is a starting posture for group leaders, not a finished standard. Known gaps to keep improving:

  • ROI measurement is hard: time saved is easy to feel and hard to attribute cleanly; verification cost is routinely under-counted.
  • Disclosure norms are unsettled: journals and institutions differ and keep changing — re-check per submission.
  • Provenance tooling is immature: capturing model + version, prompts, and enabled tools automatically is still largely manual today.
  • Policy drift: Terms of Use, retention, pricing, and limits change across vendors — re-verify on a schedule, not once.
Tip Treat this sheet as a living document. As your group learns a stronger practice, fold it back in — the goal is better governance and better science, not just recording current habits.