Research tooling · Pre-submission review modeling

Simulated NIH Review Panel

A multi-agent system that models a National Institutes of Health study-section review of a grant application, producing scored critiques aligned with current NIH criteria. The system is grounded in publicly available Center for Scientific Review rosters and calibrated against published summary statements.

Access the application Access is restricted. Authentication via Cloudflare on click.

Abstract

Single-call large-language-model reviewers conflate reviewer identity, scoring rubric, and panel dynamics into one request and consequently mimic the format of an NIH summary statement without reproducing the process that produces one. We describe a system that decomposes pre-submission review into six stages — grant parsing, funding-call alignment, study-section recommendation, reviewer profiling, multi-round simulated review, and summary synthesis — in which each stage is a specialized agent with access to external data sources and a structured, validated output contract. Nine reviewer personas, derived from real CSR roster members and their publication records, deliberate across three rounds with figure-aware multimodal inputs and programmatically enforced scoring constraints. Calibration runs against de-identified R01 applications with published summary statements establish that alignment with the real panel depends on matching a single “criticality” parameter to the apparent competitiveness of the submission.

Why not a chatbot

Uploading a grant to a general-purpose conversational model yields plausible prose but elides four things that peer review actually rests on:

Reviewer identity. An NIH reviewer brings a specific publication record, active grants, and sub-disciplinary lens. “Imagine you are an NIH reviewer” collapses a thousand such lenses into one.
Panel composition. Study sections are assembled to cover overlapping but distinct areas of expertise. Disagreement between reviewers is a feature of the process, not noise to be averaged out.
Scoring discipline. Current NIH criteria impose structural constraints (e.g., Approach cannot score better than Significance) that are routinely violated by unconstrained model output. These must be enforced, not suggested.
Evidence traceability. Reviewers cite sections, figures, and prior literature. A single-shot summarization loses the grounding that makes a critique actionable during revision.

The system described here addresses each of these through explicit pipeline stages rather than longer prompts.

System architecture

Figure 1. Six specialized agents with explicit data handoffs. Dashed arrows mark external data consumption: NIH Funding Opportunity Announcements (fetched per submission), a bundled database of 221 Center for Scientific Review study sections and their 2,832 public roster members, and per-reviewer profiling through the NIH RePORTER API and web search. Stage 5 is the only stage where reviewers receive figures alongside text; scoring constraints (Approach cannot score better than Significance) are enforced both inside the reviewer prompt and as post-validation on the returned schema. Every inter-agent handoff is structurally typed and paused for user review.

Calibration

We benchmarked the pipeline against three de-identified R01 applications with published summary statements, spanning two funded submissions (one high-priority, one borderline) and one submission that was not discussed. For each application, we ran the full simulation at four criticality settings (3, 5, 7, 9) and compared the mean simulated scores to the mean scores recorded in the corresponding NIH summary statement on the two criteria that carry a numeric score (Significance and Approach).

Figure 2. Approach-score delta (simulated minus real) for three benchmark applications across four criticality settings. A criticality of 3–5 best approximates real funded panels; a criticality of 7 approximates panels that declined to discuss. Significance-score deltas (not shown) follow the same shape with smaller amplitude. The single criticality setting is the dominant lever in matching a real panel; matching it to the apparent competitiveness of the submission is a user-facing choice, not an internal autotune.

The calibration is therefore not a claim that the system predicts a real review outcome. It establishes that the scoring distribution is tunable to plausible panel behavior across the competitiveness spectrum, and that a user who seeks a stress-test reading of a draft can do so deliberately by dialing the panel up or down.

Scope and limitations

What the system does

Produces a scored critique in current NIH format (Significance, Approach, Expertise, Overall Impact).
Draws reviewer personas from real, publicly listed Center for Scientific Review standing roster members.
Profiles each reviewer against NIH RePORTER and web sources to assign plausible sub-domain expertise.
Enforces scoring constraints (Approach ≤ Significance; Overall Impact ≤ Significance) both in prompt and in code.
Supplies reviewers with extracted figures alongside text; figures meaningfully affect scoring commentary.
Exposes intermediate outputs (parsed grant, FOA match, recommended sections, roster) to the user between stages.

What the system does not do

Predict funding outcomes. The output is a modeled critique, not a probability estimate.
Contact, identify, or disclose any real reviewer's private information. Profiling uses only public data.
Replace programmatic review or substitute for an NIH Scientific Review Officer's judgment.
Bypass institutional review or grants-management processes.

Calibration applications are de-identified in this documentation by design. Summary statements and application text used internally are not redistributed, and no applicant submission is retained beyond a session unless the user explicitly exports it.

Access

The application is served at a separate, authenticated origin. Credential checks are handled by Cloudflare Access; the application itself does not manage credentials.

Access the application