Sponsor-Study Cutover Checklist (M7)¶
Purpose. Every M7 §6 fix codifies a guarantee the platform makes about a REDCap-managed study running in production. This checklist turns those guarantees into operator-verifiable steps so the first sponsor cutover is a sequence of green checks, not a series of improvisations.
Audience. Coordinator + study admin doing the cutover, with one engineer on standby. Each item names the file, env var, test, or audit-log entry that proves the check passes.
Scope. REDCap-managed studies (Study.integration_mode = "redcap") only. Metricis-managed studies are M9 territory and are not covered here.
Cross-references.
- Engineering log: docs/project-plan.md §6 — every check below points back to the numbered §6 entry that built the guarantee.
- Generic deployment: docs/guides/deployment.md — server/portal infra, nginx, docker, TLS. This document layers on top of that.
- Architecture context: CLAUDE.md "Key Concepts" — integration mode, unified scheduler, consent gate, REDCap fail-safe policy.
1. Go / No-Go Matrix¶
Every row must be ✅ before the study can take live participants. A single ❌ blocks cutover; flag the row to the engineer on standby.
| # | Check | Verify by | §6 ref |
|---|---|---|---|
| 1 | ENVIRONMENT=production everywhere it matters |
printenv ENVIRONMENT on every app/worker container; healthcheck /api/health returns environment field |
#1, #3 |
| 2 | REDCAP_ENCRYPTION_KEY set, distinct from JWT_SECRET_KEY |
python -c "from app.config import get_settings; print(bool(get_settings().redcap_encryption_key))" returns True; verify env var ≠ JWT secret |
#2 |
| 3 | All REDCap tokens stored as enc:v2: ciphertext |
python server/migrate_redcap_tokens.py (dry-run) reports zero rotations needed |
#2, #7, #19 |
| 4 | Webhook secret encrypted at rest | Same migration script reports zero rotations on webhook_secret field |
#19 |
| 5 | Study.integration_mode = "redcap" matches redcap_enabled = True |
pytest tests/test_compliance_invariants.py::TestIntegrationModeRedcapEnabledConsistency green |
#15 |
| 6 | Time simulation OFF | pytest tests/test_compliance_invariants.py::TestTimeSimulationProductionGate green; Study.test_mode_config.time_simulation_offset_days is 0 or null in DB |
#1 |
| 7 | Dev/test routers gated | tests/test_production_gating.py (25 tests) green in production env |
#3 |
| 8 | DET webhook idempotent | tests/test_webhook_idempotency.py (7 tests) green |
#17 |
| 9 | DET webhook signed by webhook_secret |
A test DET fire with wrong signature returns 401; with the correct one, 200 | #17, #19 |
| 10 | Sync no-fallback invariant | tests/test_compliance_invariants.py::TestREDCapSyncNoFallbackInvariant green |
#14 |
| 11 | Per-study admin assigned for the sponsor study | UserStudy(role="admin"|"owner") exists for at least one operator on this study |
#16 |
| 12 | Coordinator dashboard "Failed Sync" tile renders | Visit dashboard while a synthetic failed-sync session exists; tile shows count > 0 | #14 |
| 13 | Patient-portal stale-data banner renders | While a synthetic failed sync exists, hit /api/portal/data-status for an authenticated participant; has_failed_sync: true |
#4, #14 |
| 14 | Audit-log hash chain valid | Run nightly verification (python server/audit_integrity.py or wait for nightly CI) |
(audit infrastructure) |
| 15 | All M7 regression suites green | pytest tests/test_redcap_*.py tests/test_anchor_shift_reconciliation.py tests/test_schedule_versioning_atomicity.py tests/test_portal_redcap_ingestion.py tests/test_compliance_invariants.py |
M7 §6 |
The 9 pre-existing failures documented in the project plan (clock-skew flake, M9-deferred randomization fixtures, etc.) do not block cutover; everything else must pass.
2. Environment Prerequisites¶
2.1 Required env vars (server + workers)¶
ENVIRONMENT=production # gates §6 #1, #3, #16, #17
DATABASE_URL=postgresql+asyncpg://… # production DB, not dev
REDIS_URL=redis://… # required for session storage in prod
SESSION_STORAGE_BACKEND=redis # `memory` is dev-only
JWT_SECRET_KEY=… # generate fresh; do NOT reuse
SESSION_SECRET_KEY=… # generate fresh; do NOT reuse
REDCAP_ENCRYPTION_KEY=… # 32+ urlsafe-base64 chars; distinct from JWT
# REDCAP_ENCRYPTION_KEY_PREVIOUS=… # optional; set during a rotation window
ALLOWED_ORIGINS=https://app.metricis.app,… # tighten for production
RATE_LIMIT_PER_MINUTE=60
AUTH_RATE_LIMIT_PER_MINUTE=10
Generate fresh secrets:
2.2 Forbidden env vars in production¶
These must NOT be set (or must be false/0):
DEV_MODE # bypasses every dev-mode gate (§6 #1, #3)
REDCAP_ENCRYPTION_ALLOW_PLAINTEXT_READS # one-time migration bridge only (§6 #2)
If DEV_MODE=true slips through, every /api/dev/* and /api/testing/* route becomes reachable. The router-level production gate returns 404 only when ENVIRONMENT=production AND DEV_MODE is unset/falsy (§6 #3).
2.3 Optional tunables¶
REDCAP_DET_RATE_LIMIT_PER_MINUTE=120 # default 120; raise if a busy project is dropping events
REDCAP_DET_TEST_RATE_LIMIT_PER_MINUTE=10 # never reachable in prod (404), but still rate-limited
REDCAP_DET_IDEMPOTENCY_TTL_SECONDS=86400 # default 24h; shorten if legitimate re-saves are deduped
REDCAP_CIRCUIT_BREAKER_FAILURE_THRESHOLD=5 # default 5 consecutive failures before open
REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS=60 # default 60s; raise if upstream takes longer to stabilize
REDCAP_CIRCUIT_BREAKER_ENABLED=true # set to false ONLY for explicit operator override
Keep defaults unless there's evidence to change them.
3. Database Prerequisites¶
3.1 Migration head¶
The required head as of M7 closeout: f0a1b2c3d4e5_add_redcap_project_id_index. Earlier heads relevant to M7:
- d2e3f4a5b6c7_add_webhook_events — DET idempotency table (§6 #17)
- e3f4a5b6c7d8_add_survey_task_fields — patient-portal survey task (§6 #4, #20)
3.2 Index verification¶
The _find_study_by_redcap_project lookup (§6 #24) is index-backed via a partial functional index. Verify:
SELECT indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'app' AND indexname = 'ix_studies_redcap_project_id';
WHERE ((integration_mode)::text = 'redcap'::text) in indexdef. Asserted by tests/test_redcap_project_id_lookup.py::TestProjectIdIndexExists.
3.3 Audit-log hash chain¶
Before going live, run:
Expected output: chain valid through the most recent row. Any tampering would print the offending row'sid and previous_hash mismatch.
4. REDCap Project Configuration¶
4.1 Token + webhook secret¶
Store both via the portal's REDCap config screen — never directly in study.config:
- The portal calls routers/redcap.py::update_redcap_config which encrypts via encrypt_token() (§6 #2).
- Webhook secret takes the same path via update_webhook_config (§6 #19).
After saving, verify the at-rest format is enc:v2:…:
psql -c "SELECT id, code,
(config->'redcap'->>'api_token') LIKE 'enc:v2:%' AS api_v2,
(config->'redcap'->>'webhook_secret') LIKE 'enc:v2:%' AS secret_v2
FROM app.studies WHERE integration_mode = 'redcap';"
t for the sponsor study row.
If either is f, run the migration script:
4.2 DET webhook URL¶
In REDCap → Project Setup → Additional customizations → Data Entry Trigger:
Confirm ENVIRONMENT=production. The /api/webhooks/redcap/det/test endpoint returns 404 in production (§6 #16, #25) — REDCap's built-in DET test feature will fail, which is correct and expected; switch to live form save for end-to-end verification.
4.3 Event-instrument mapping sync¶
Pre-cutover, run an event sync to populate VisitWindow rows from REDCap:
# via REDCap admin API in the portal, or programmatically
POST /api/studies/{study_id}/redcap/sync-events
success=true, created>0 on first run; subsequent runs return created=0, updated=N (idempotency, §6 #21).
4.4 Field mappings¶
Field-level mappings (participant_field_mapping, redcap_field_mapping) are study-specific. Two non-negotiable items:
- record_id_field set to participant_code or external_id (must match the column REDCap uses for the record ID).
- anchor_date_field set to whatever REDCap field carries the enrollment/baseline date for this protocol.
If participant_field_mapping is missing, the DET sync uses the default mapping in redcap_det_sync._get_default_field_mapping (§6 #18) — confirm that default is appropriate for the sponsor study or override.
5. Metricis Study Configuration¶
5.1 Integration mode + consent gate¶
In the portal's Configure → Study Settings:
- integration_mode = "redcap" (NOT inferred from redcap_enabled; set explicitly per §6 #15).
- constraints.consent_mode chosen explicitly: digital, manual, waived, or implied.
- constraints.require_consent_for_scheduling = true (default).
- constraints.require_consent_for_messaging = true (default).
Cross-check via SQL:
SELECT code, integration_mode, redcap_enabled,
config->'constraints'->>'consent_mode' AS consent_mode
FROM app.studies WHERE code = '<sponsor study code>';
5.2 Per-study admin / owner¶
The seven high-risk REDCap endpoints (token rotation, init, dict push, form delete, participant import, webhook secret update — see §6 #16) require UserStudy.role ∈ {admin, owner} for the specific study. Global User.role = "admin" is not sufficient.
Verify:
SELECT u.email, us.role
FROM app.user_studies us
JOIN app.users u ON u.id = us.user_id
JOIN app.studies s ON s.id = us.study_id
WHERE s.code = '<sponsor study code>' AND us.role IN ('admin', 'owner');
5.3 Anchor date policy¶
Configure via portal → Configure → Enrollment Date Policy. Confirm:
- sources ordered by priority (typically consent_workflow → manual_entry for REDCap-managed studies relying on REDCap's enrollment instrument).
- re_anchoring.completed_visit_handling = "flag_for_review" (default; preserves history per §6 #8).
- permissions.can_override restricted appropriately.
Anchor-shift reconciliation is asserted by 12 tests in tests/test_anchor_shift_reconciliation.py.
5.4 Sites¶
Each REDCap DAG used by the project must map to a Metricis Site row:
The DET sync stores DAG provenance in participant.extra_data['redcap_dag'] (§6 #18); coordinator filtering uses the Metricis site row.
6. Pre-Cutover Smoke Tests¶
Run these in order against a staging deployment that mirrors production env vars. Each one must pass before the next begins.
6.1 Compliance invariants (must never fail)¶
Includes consent gate, REDCap sync no-fallback, integration_mode consistency, time-simulation gate, dev-mode-disabled-in-production, and audit-log invariants.6.2 Production gating¶
25 tests covering every/api/dev/*, /api/testing/*, and /api/webhooks/redcap/det/test endpoint return 404 in production. Adding a new dev/test endpoint requires updating DEV_TEST_ROUTES in the test file — single point of update.
6.3 REDCap-managed lifecycle E2E¶
10 tests across 5 stages: event sync → visit scheduling → portal delivery → completion sync (success + failure) → portal stale-data signalling. REDCap I/O is mocked, so this runs offline.6.4 Live DET webhook end-to-end (one-shot)¶
In a staging study mapped to a staging REDCap project:
1. Configure the DET URL in REDCap to point at staging.
2. Save a record in REDCap that triggers the configured enrollment_instrument.
3. Verify within ~30s:
SELECT id, source, project_id, record_id, status, processed_at
FROM app.webhook_events ORDER BY received_at DESC LIMIT 1;
status='processed', processed_at populated.
4. Verify the participant landed:
SELECT participant_code, status, enrollment_date FROM app.participants
WHERE external_id = '<the REDCap record id>';
ScheduleVersion was created (§6 #18):
SELECT version_number, status, is_current, anchor_date_used FROM app.schedule_versions
WHERE participant_id = '<participant uuid>';
version_number=1, status='active', is_current=true.
6.5 Replay protection¶
Re-fire the same webhook payload. Expected: a second webhook_events row with status='duplicate' and duplicate_of_id pointing at the original. No second Participant/ScheduleVersion created (§6 #17).
6.6 Failure-path smoke¶
Misconfigure the REDCap URL on the staging study so the API call will fail. Submit a Metricis cognitive-assessment session for a participant, watch:
- Session.sync_status flips to failed.
- CRITICAL log emitted with source: "redcap_sync_failure", requires_investigation: true, fallback_used: false (§6 #14).
- Patient-portal /api/portal/data-status returns has_failed_sync: true.
- After ≥5 consecutive failures (default REDCAP_CIRCUIT_BREAKER_FAILURE_THRESHOLD), subsequent submissions short-circuit with error_type='circuit_open' and PyCap is not called (§6 #23). Verify by counting httpx/requests mocks or by watching log frequency.
Restore the URL, wait REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS, submit again — the half-open trial succeeds, breaker closes, sync resumes.
7. Cutover Sequence¶
Recommended order. Each step assumes the previous one succeeded.
- Freeze writes on the staging study so QA state doesn't drift into prod.
- Apply migrations on the production DB:
alembic upgrade head. Verify head matches §3.1. - Set production env vars including
REDCAP_ENCRYPTION_KEY. Restart server + Celery workers + Celery beat. - Run Go/No-Go matrix (§1). Stop on the first ❌.
- Configure the production REDCap project: token, webhook secret, DET URL pointing at production.
- Sync events:
POST /api/studies/{id}/redcap/sync-events. ConfirmVisitWindowrows. - Provision per-study admin (§5.2).
- Smoke a single participant end-to-end via DET (§6.4) using a real-but-non-PHI test record in REDCap.
- Enable the study for live enrolment (status flip
draft → activeif not already). - Notify the coordinator that the study is open.
Total elapsed time on a healthy environment: ~30 minutes once §1 is green.
8. Post-Cutover Monitoring (first 7 days)¶
Watch these signals daily for the first week, then weekly:
| Signal | Where | Action when seen |
|---|---|---|
Session.sync_status='failed' count > 0 |
Coordinator dashboard "Failed Sync" tile (§6 #14) | Open a ticket; investigate the CRITICAL log; never auto-retry without root-cause |
webhook_events.status='failed' |
SELECT * FROM app.webhook_events WHERE status='failed' ORDER BY received_at DESC |
Check error_message; manual replay only after fix |
webhook_events.status='duplicate' count growing fast |
Same query | Expected on REDCap retries; if growth is non-linear, check upstream behaviour |
| Portal stale-data banner persistent | /api/portal/data-status has_failed_sync=true |
Tied to #failed sessions; clearing requires successful re-sync |
| Circuit breaker open | Log message REDCap circuit breaker opened for study <id> |
Investigate REDCap reachability; breaker will half-open after cooldown automatically |
| Audit-log chain break | Nightly CI audit_integrity.py failure |
Stop. Hash chain breaks are tamper signals — escalate to security |
| Time-simulation offset persisted | SELECT test_mode_config->>'time_simulation_offset_days' FROM app.studies WHERE … |
Should be null/0 in production. get_effective_date ignores it (§6 #1) but a stale offset is still a smell |
9. Failure-Mode Runbook¶
9.1 Sync failed: sync_status='failed'¶
Symptom. Coordinator dashboard shows the "Failed Sync" tile with count > 0. Patient-portal banner reads "Your responses were saved on this device. The study team has been notified."
Triage. 1. Find the failed session:
SELECT id, participant_id, completed_at, sync_status FROM app.sessions
WHERE sync_status = 'failed' ORDER BY completed_at DESC;
source: "redcap_sync_failure"). Read error_type + error_message.
3. Investigate per error type:
- api_error → REDCap unreachable or returning 5xx. Check REDCap status page; check token validity.
- validation → REDCap rejected the payload. Read the error message; fix the field mapping or the source data.
- exception → Metricis-side bug. Capture the full traceback, file an issue.
- circuit_open → upstream is currently unreachable enough to trip the breaker. Wait for cooldown; investigate root cause.
4. Once the root cause is fixed, re-sync via POST /api/studies/{id}/sync for the affected sessions. The re-sync goes through the same canonical pipeline.
Do not. Disable the no-fallback invariant. Compute a Metricis-side substitute. Auto-retry without operator action. The §6 #14 invariant test refuses any service that branches on sync_status for a fallback decision — bypassing it requires an explicit allowlist edit which CODEOWNERS will flag.
9.2 Circuit breaker open¶
Symptom. Logs report REDCap circuit breaker opened for study <id>; subsequent sync attempts return success=false without touching PyCap.
Triage.
1. Confirm REDCap reachability: curl -I <redcap_url>.
2. If REDCap is down, wait for it to recover; the breaker will half-open after REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS and probe automatically.
3. If REDCap is up but Metricis can't reach it (DNS, firewall, cert), fix the network.
4. To force a manual reset (rarely needed):
from app.services.redcap_circuit_breaker import circuit_breaker
await circuit_breaker.reset(study_id)
9.3 Webhook secret rotation¶
Use cases: routine rotation, suspected secret leak, change of REDCap admin.
- Generate a new secret:
python -c "import secrets; print(secrets.token_urlsafe(32))". - Update REDCap project's DET secret first (REDCap sees it immediately).
- Update Metricis: portal → REDCap config → "Rotate webhook secret" (calls
update_webhook_config, which encrypts viaencrypt_token, §6 #19). - Confirm by firing a test DET in REDCap (admin tools); Metricis should accept the signature. Wrong-secret signatures return 401 (asserted by
tests/test_webhook_secret_encryption.py).
If a key rotation is needed (REDCAP_ENCRYPTION_KEY itself), follow the dual-key bridge:
1. Set REDCAP_ENCRYPTION_KEY_PREVIOUS=<current key>.
2. Set REDCAP_ENCRYPTION_KEY=<new key>. Restart.
3. Run python server/migrate_redcap_tokens.py --apply to re-encrypt every token onto the new key.
4. Unset REDCAP_ENCRYPTION_KEY_PREVIOUS. Restart.
The MultiFernet decrypt chain handles the bridge window (§6 #2, #7).
9.4 Anchor date shift mid-study¶
The coordinator changes a participant's enrollment date after some visits have completed. Expected behaviour (§6 #8):
- Old
ScheduleVersionis markedsuperseded; old completed visits stay on it. - Old completed visits get
anchor_reconciled=true+original_target_datesnapshot; theirscheduled_dateandactual_visit_dateare NOT rewritten. - Old pending visits become
status='cancelled'(soft-deleted, audit-preserved). - New
ScheduleVersionv2 is created; new visits use the new anchor date. - An
AuditLogrow records the reconciled count and old/new anchor dates.
Operator action: confirm the audit-log row exists and the participant's portal schedule reflects the new anchor. No manual cleanup needed.
9.5 Stuck webhook_events.status='processing'¶
A row left in processing means _process_det_webhook started but didn't finish — likely a process restart mid-flight.
SELECT id, source, project_id, record_id, status, received_at, retry_count
FROM app.webhook_events WHERE status = 'processing' AND received_at < NOW() - interval '15 minutes';
Treat as a retry candidate: hand back to a coordinator-facing retry endpoint (currently a future enhancement; for now, a manual replay via the REDCap admin tools is the approved workaround). Do not auto-retry from a worker.
10. Rollback¶
The cutover is mostly reversible if caught early.
| Action | Reversible by |
|---|---|
alembic upgrade head |
alembic downgrade -1 (per migration); test on staging first |
| Token encryption rotation | REDCAP_ENCRYPTION_KEY_PREVIOUS bridge restores prior decrypt path |
| Webhook secret rotation | Re-enter the previous secret in REDCap + Metricis |
Study status='active' |
Revert to paused via portal; cleanly halts new enrolments without affecting existing participants |
| Bad participant created via DET | Participant.status='withdrawn' + audit log explanation; do NOT delete (audit invariants) |
A full DB rollback (point-in-time restore) is not an M7-scoped capability — that's a deployment-infra concern. Coordinate with the DB admin if this is needed.
11. Sign-off¶
When every Go/No-Go row is ✅, the engineer on standby and the study admin should each record a sign-off in the project's runbook log (or as an entry in the audit log via AuditLog(action="cutover_signoff")).
After sign-off, M7 ships and routine ops begin. Subsequent sponsor studies follow this same checklist; if any row needs to change for a different sponsor, update the checklist itself, not the per-study workflow.