Simulated NIH Review Panel
A multi-agent system that models a National Institutes of Health study-section review of a grant application, producing scored critiques aligned with current NIH criteria. The system is grounded in publicly available Center for Scientific Review rosters and calibrated against published summary statements.
Single-call large-language-model reviewers conflate reviewer identity, scoring rubric, and panel dynamics into one request and consequently mimic the format of an NIH summary statement without reproducing the process that produces one. We describe a system that decomposes pre-submission review into six stages — grant parsing, funding-call alignment, study-section recommendation, reviewer profiling, multi-round simulated review, and summary synthesis — in which each stage is a specialized agent with access to external data sources and a structured, validated output contract. Nine reviewer personas, derived from real CSR roster members and their publication records, deliberate across three rounds with figure-aware multimodal inputs and programmatically enforced scoring constraints. Calibration runs against de-identified R01 applications with published summary statements establish that alignment with the real panel depends on matching a single “criticality” parameter to the apparent competitiveness of the submission.
Uploading a grant to a general-purpose conversational model yields plausible prose but elides four things that peer review actually rests on:
- Reviewer identity. An NIH reviewer brings a specific publication record, active grants, and sub-disciplinary lens. “Imagine you are an NIH reviewer” collapses a thousand such lenses into one.
- Panel composition. Study sections are assembled to cover overlapping but distinct areas of expertise. Disagreement between reviewers is a feature of the process, not noise to be averaged out.
- Scoring discipline. Current NIH criteria impose structural constraints (e.g., Approach cannot score better than Significance) that are routinely violated by unconstrained model output. These must be enforced, not suggested.
- Evidence traceability. Reviewers cite sections, figures, and prior literature. A single-shot summarization loses the grounding that makes a critique actionable during revision.
The system described here addresses each of these through explicit pipeline stages rather than longer prompts.
We benchmarked the pipeline against three de-identified R01 applications with published summary statements, spanning two funded submissions (one high-priority, one borderline) and one submission that was not discussed. For each application, we ran the full simulation at four criticality settings (3, 5, 7, 9) and compared the mean simulated scores to the mean scores recorded in the corresponding NIH summary statement on the two criteria that carry a numeric score (Significance and Approach).
The calibration is therefore not a claim that the system predicts a real review outcome. It establishes that the scoring distribution is tunable to plausible panel behavior across the competitiveness spectrum, and that a user who seeks a stress-test reading of a draft can do so deliberately by dialing the panel up or down.
What the system does
- Produces a scored critique in current NIH format (Significance, Approach, Expertise, Overall Impact).
- Draws reviewer personas from real, publicly listed Center for Scientific Review standing roster members.
- Profiles each reviewer against NIH RePORTER and web sources to assign plausible sub-domain expertise.
- Enforces scoring constraints (Approach ≤ Significance; Overall Impact ≤ Significance) both in prompt and in code.
- Supplies reviewers with extracted figures alongside text; figures meaningfully affect scoring commentary.
- Exposes intermediate outputs (parsed grant, FOA match, recommended sections, roster) to the user between stages.
What the system does not do
- Predict funding outcomes. The output is a modeled critique, not a probability estimate.
- Contact, identify, or disclose any real reviewer's private information. Profiling uses only public data.
- Replace programmatic review or substitute for an NIH Scientific Review Officer's judgment.
- Bypass institutional review or grants-management processes.
Calibration applications are de-identified in this documentation by design. Summary statements and application text used internally are not redistributed, and no applicant submission is retained beyond a session unless the user explicitly exports it.
The application is served at a separate, authenticated origin. Credential checks are handled by Cloudflare Access; the application itself does not manage credentials.