Skip to content

Sponsor-Study Cutover Checklist (M7)

Purpose. Every M7 §6 fix codifies a guarantee the platform makes about a REDCap-managed study running in production. This checklist turns those guarantees into operator-verifiable steps so the first sponsor cutover is a sequence of green checks, not a series of improvisations.

Audience. Coordinator + study admin doing the cutover, with one engineer on standby. Each item names the file, env var, test, or audit-log entry that proves the check passes.

Scope. REDCap-managed studies (Study.integration_mode = "redcap") only. Metricis-managed studies are M9 territory and are not covered here.

Cross-references. - Engineering log: docs/project-plan.md §6 — every check below points back to the numbered §6 entry that built the guarantee. - Generic deployment: docs/guides/deployment.md — server/portal infra, nginx, docker, TLS. This document layers on top of that. - Architecture context: CLAUDE.md "Key Concepts" — integration mode, unified scheduler, consent gate, REDCap fail-safe policy.


1. Go / No-Go Matrix

Every row must be ✅ before the study can take live participants. A single ❌ blocks cutover; flag the row to the engineer on standby.

# Check Verify by §6 ref
1 ENVIRONMENT=production everywhere it matters printenv ENVIRONMENT on every app/worker container; healthcheck /api/health returns environment field #1, #3
2 REDCAP_ENCRYPTION_KEY set, distinct from JWT_SECRET_KEY python -c "from app.config import get_settings; print(bool(get_settings().redcap_encryption_key))" returns True; verify env var ≠ JWT secret #2
3 All REDCap tokens stored as enc:v2: ciphertext python server/migrate_redcap_tokens.py (dry-run) reports zero rotations needed #2, #7, #19
4 Webhook secret encrypted at rest Same migration script reports zero rotations on webhook_secret field #19
5 Study.integration_mode = "redcap" matches redcap_enabled = True pytest tests/test_compliance_invariants.py::TestIntegrationModeRedcapEnabledConsistency green #15
6 Time simulation OFF pytest tests/test_compliance_invariants.py::TestTimeSimulationProductionGate green; Study.test_mode_config.time_simulation_offset_days is 0 or null in DB #1
7 Dev/test routers gated tests/test_production_gating.py (25 tests) green in production env #3
8 DET webhook idempotent tests/test_webhook_idempotency.py (7 tests) green #17
9 DET webhook signed by webhook_secret A test DET fire with wrong signature returns 401; with the correct one, 200 #17, #19
10 Sync no-fallback invariant tests/test_compliance_invariants.py::TestREDCapSyncNoFallbackInvariant green #14
11 Per-study admin assigned for the sponsor study UserStudy(role="admin"|"owner") exists for at least one operator on this study #16
12 Coordinator dashboard "Failed Sync" tile renders Visit dashboard while a synthetic failed-sync session exists; tile shows count > 0 #14
13 Patient-portal stale-data banner renders While a synthetic failed sync exists, hit /api/portal/data-status for an authenticated participant; has_failed_sync: true #4, #14
14 Audit-log hash chain valid Run nightly verification (python server/audit_integrity.py or wait for nightly CI) (audit infrastructure)
15 All M7 regression suites green pytest tests/test_redcap_*.py tests/test_anchor_shift_reconciliation.py tests/test_schedule_versioning_atomicity.py tests/test_portal_redcap_ingestion.py tests/test_compliance_invariants.py M7 §6

The 9 pre-existing failures documented in the project plan (clock-skew flake, M9-deferred randomization fixtures, etc.) do not block cutover; everything else must pass.


2. Environment Prerequisites

2.1 Required env vars (server + workers)

ENVIRONMENT=production                           # gates §6 #1, #3, #16, #17
DATABASE_URL=postgresql+asyncpg://…               # production DB, not dev
REDIS_URL=redis://…                               # required for session storage in prod
SESSION_STORAGE_BACKEND=redis                     # `memory` is dev-only
JWT_SECRET_KEY=                                  # generate fresh; do NOT reuse
SESSION_SECRET_KEY=                              # generate fresh; do NOT reuse
REDCAP_ENCRYPTION_KEY=                           # 32+ urlsafe-base64 chars; distinct from JWT
# REDCAP_ENCRYPTION_KEY_PREVIOUS=…                # optional; set during a rotation window
ALLOWED_ORIGINS=https://app.metricis.app,…        # tighten for production
RATE_LIMIT_PER_MINUTE=60
AUTH_RATE_LIMIT_PER_MINUTE=10

Generate fresh secrets:

python -c "import secrets; print(secrets.token_urlsafe(32))"

2.2 Forbidden env vars in production

These must NOT be set (or must be false/0):

DEV_MODE                                          # bypasses every dev-mode gate (§6 #1, #3)
REDCAP_ENCRYPTION_ALLOW_PLAINTEXT_READS           # one-time migration bridge only (§6 #2)

If DEV_MODE=true slips through, every /api/dev/* and /api/testing/* route becomes reachable. The router-level production gate returns 404 only when ENVIRONMENT=production AND DEV_MODE is unset/falsy (§6 #3).

2.3 Optional tunables

REDCAP_DET_RATE_LIMIT_PER_MINUTE=120              # default 120; raise if a busy project is dropping events
REDCAP_DET_TEST_RATE_LIMIT_PER_MINUTE=10          # never reachable in prod (404), but still rate-limited
REDCAP_DET_IDEMPOTENCY_TTL_SECONDS=86400          # default 24h; shorten if legitimate re-saves are deduped
REDCAP_CIRCUIT_BREAKER_FAILURE_THRESHOLD=5        # default 5 consecutive failures before open
REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS=60        # default 60s; raise if upstream takes longer to stabilize
REDCAP_CIRCUIT_BREAKER_ENABLED=true               # set to false ONLY for explicit operator override

Keep defaults unless there's evidence to change them.


3. Database Prerequisites

3.1 Migration head

cd server && source .venv/bin/activate
alembic current      # report
alembic upgrade head # apply

The required head as of M7 closeout: f0a1b2c3d4e5_add_redcap_project_id_index. Earlier heads relevant to M7: - d2e3f4a5b6c7_add_webhook_events — DET idempotency table (§6 #17) - e3f4a5b6c7d8_add_survey_task_fields — patient-portal survey task (§6 #4, #20)

3.2 Index verification

The _find_study_by_redcap_project lookup (§6 #24) is index-backed via a partial functional index. Verify:

SELECT indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'app' AND indexname = 'ix_studies_redcap_project_id';
Expected: one row, with the predicate WHERE ((integration_mode)::text = 'redcap'::text) in indexdef. Asserted by tests/test_redcap_project_id_lookup.py::TestProjectIdIndexExists.

3.3 Audit-log hash chain

Before going live, run:

python server/audit_integrity.py
Expected output: chain valid through the most recent row. Any tampering would print the offending row's id and previous_hash mismatch.


4. REDCap Project Configuration

4.1 Token + webhook secret

Store both via the portal's REDCap config screen — never directly in study.config: - The portal calls routers/redcap.py::update_redcap_config which encrypts via encrypt_token() (§6 #2). - Webhook secret takes the same path via update_webhook_config (§6 #19).

After saving, verify the at-rest format is enc:v2:…:

psql -c "SELECT id, code,
  (config->'redcap'->>'api_token') LIKE 'enc:v2:%' AS api_v2,
  (config->'redcap'->>'webhook_secret') LIKE 'enc:v2:%' AS secret_v2
FROM app.studies WHERE integration_mode = 'redcap';"
Expected: both columns t for the sponsor study row.

If either is f, run the migration script:

python server/migrate_redcap_tokens.py --apply

4.2 DET webhook URL

In REDCap → Project Setup → Additional customizations → Data Entry Trigger:

URL: https://api.metricis.app/api/webhooks/redcap/det

Confirm ENVIRONMENT=production. The /api/webhooks/redcap/det/test endpoint returns 404 in production (§6 #16, #25) — REDCap's built-in DET test feature will fail, which is correct and expected; switch to live form save for end-to-end verification.

4.3 Event-instrument mapping sync

Pre-cutover, run an event sync to populate VisitWindow rows from REDCap:

# via REDCap admin API in the portal, or programmatically
POST /api/studies/{study_id}/redcap/sync-events
Expected: success=true, created>0 on first run; subsequent runs return created=0, updated=N (idempotency, §6 #21).

4.4 Field mappings

Field-level mappings (participant_field_mapping, redcap_field_mapping) are study-specific. Two non-negotiable items: - record_id_field set to participant_code or external_id (must match the column REDCap uses for the record ID). - anchor_date_field set to whatever REDCap field carries the enrollment/baseline date for this protocol.

If participant_field_mapping is missing, the DET sync uses the default mapping in redcap_det_sync._get_default_field_mapping (§6 #18) — confirm that default is appropriate for the sponsor study or override.


5. Metricis Study Configuration

In the portal's Configure → Study Settings: - integration_mode = "redcap" (NOT inferred from redcap_enabled; set explicitly per §6 #15). - constraints.consent_mode chosen explicitly: digital, manual, waived, or implied. - constraints.require_consent_for_scheduling = true (default). - constraints.require_consent_for_messaging = true (default).

Cross-check via SQL:

SELECT code, integration_mode, redcap_enabled,
       config->'constraints'->>'consent_mode' AS consent_mode
FROM app.studies WHERE code = '<sponsor study code>';

5.2 Per-study admin / owner

The seven high-risk REDCap endpoints (token rotation, init, dict push, form delete, participant import, webhook secret update — see §6 #16) require UserStudy.role ∈ {admin, owner} for the specific study. Global User.role = "admin" is not sufficient.

Verify:

SELECT u.email, us.role
FROM app.user_studies us
JOIN app.users u ON u.id = us.user_id
JOIN app.studies s ON s.id = us.study_id
WHERE s.code = '<sponsor study code>' AND us.role IN ('admin', 'owner');
Expected: at least one row, ideally two for redundancy.

5.3 Anchor date policy

Configure via portal → Configure → Enrollment Date Policy. Confirm: - sources ordered by priority (typically consent_workflowmanual_entry for REDCap-managed studies relying on REDCap's enrollment instrument). - re_anchoring.completed_visit_handling = "flag_for_review" (default; preserves history per §6 #8). - permissions.can_override restricted appropriately.

Anchor-shift reconciliation is asserted by 12 tests in tests/test_anchor_shift_reconciliation.py.

5.4 Sites

Each REDCap DAG used by the project must map to a Metricis Site row:

SELECT code, name FROM app.sites WHERE study_id = '<sponsor study uuid>';

The DET sync stores DAG provenance in participant.extra_data['redcap_dag'] (§6 #18); coordinator filtering uses the Metricis site row.


6. Pre-Cutover Smoke Tests

Run these in order against a staging deployment that mirrors production env vars. Each one must pass before the next begins.

6.1 Compliance invariants (must never fail)

cd server && source .venv/bin/activate
pytest tests/test_compliance_invariants.py -v -m invariant
Includes consent gate, REDCap sync no-fallback, integration_mode consistency, time-simulation gate, dev-mode-disabled-in-production, and audit-log invariants.

6.2 Production gating

ENVIRONMENT=production pytest tests/test_production_gating.py -v
25 tests covering every /api/dev/*, /api/testing/*, and /api/webhooks/redcap/det/test endpoint return 404 in production. Adding a new dev/test endpoint requires updating DEV_TEST_ROUTES in the test file — single point of update.

6.3 REDCap-managed lifecycle E2E

pytest tests/test_redcap_managed_lifecycle.py -v
10 tests across 5 stages: event sync → visit scheduling → portal delivery → completion sync (success + failure) → portal stale-data signalling. REDCap I/O is mocked, so this runs offline.

6.4 Live DET webhook end-to-end (one-shot)

In a staging study mapped to a staging REDCap project: 1. Configure the DET URL in REDCap to point at staging. 2. Save a record in REDCap that triggers the configured enrollment_instrument. 3. Verify within ~30s:

SELECT id, source, project_id, record_id, status, processed_at
FROM app.webhook_events ORDER BY received_at DESC LIMIT 1;
Expected: status='processed', processed_at populated. 4. Verify the participant landed:
SELECT participant_code, status, enrollment_date FROM app.participants
WHERE external_id = '<the REDCap record id>';
5. Verify a ScheduleVersion was created (§6 #18):
SELECT version_number, status, is_current, anchor_date_used FROM app.schedule_versions
WHERE participant_id = '<participant uuid>';
Expected: version_number=1, status='active', is_current=true.

6.5 Replay protection

Re-fire the same webhook payload. Expected: a second webhook_events row with status='duplicate' and duplicate_of_id pointing at the original. No second Participant/ScheduleVersion created (§6 #17).

6.6 Failure-path smoke

Misconfigure the REDCap URL on the staging study so the API call will fail. Submit a Metricis cognitive-assessment session for a participant, watch: - Session.sync_status flips to failed. - CRITICAL log emitted with source: "redcap_sync_failure", requires_investigation: true, fallback_used: false (§6 #14). - Patient-portal /api/portal/data-status returns has_failed_sync: true. - After ≥5 consecutive failures (default REDCAP_CIRCUIT_BREAKER_FAILURE_THRESHOLD), subsequent submissions short-circuit with error_type='circuit_open' and PyCap is not called (§6 #23). Verify by counting httpx/requests mocks or by watching log frequency.

Restore the URL, wait REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS, submit again — the half-open trial succeeds, breaker closes, sync resumes.


7. Cutover Sequence

Recommended order. Each step assumes the previous one succeeded.

  1. Freeze writes on the staging study so QA state doesn't drift into prod.
  2. Apply migrations on the production DB: alembic upgrade head. Verify head matches §3.1.
  3. Set production env vars including REDCAP_ENCRYPTION_KEY. Restart server + Celery workers + Celery beat.
  4. Run Go/No-Go matrix (§1). Stop on the first ❌.
  5. Configure the production REDCap project: token, webhook secret, DET URL pointing at production.
  6. Sync events: POST /api/studies/{id}/redcap/sync-events. Confirm VisitWindow rows.
  7. Provision per-study admin (§5.2).
  8. Smoke a single participant end-to-end via DET (§6.4) using a real-but-non-PHI test record in REDCap.
  9. Enable the study for live enrolment (status flip draft → active if not already).
  10. Notify the coordinator that the study is open.

Total elapsed time on a healthy environment: ~30 minutes once §1 is green.


8. Post-Cutover Monitoring (first 7 days)

Watch these signals daily for the first week, then weekly:

Signal Where Action when seen
Session.sync_status='failed' count > 0 Coordinator dashboard "Failed Sync" tile (§6 #14) Open a ticket; investigate the CRITICAL log; never auto-retry without root-cause
webhook_events.status='failed' SELECT * FROM app.webhook_events WHERE status='failed' ORDER BY received_at DESC Check error_message; manual replay only after fix
webhook_events.status='duplicate' count growing fast Same query Expected on REDCap retries; if growth is non-linear, check upstream behaviour
Portal stale-data banner persistent /api/portal/data-status has_failed_sync=true Tied to #failed sessions; clearing requires successful re-sync
Circuit breaker open Log message REDCap circuit breaker opened for study <id> Investigate REDCap reachability; breaker will half-open after cooldown automatically
Audit-log chain break Nightly CI audit_integrity.py failure Stop. Hash chain breaks are tamper signals — escalate to security
Time-simulation offset persisted SELECT test_mode_config->>'time_simulation_offset_days' FROM app.studies WHERE … Should be null/0 in production. get_effective_date ignores it (§6 #1) but a stale offset is still a smell

9. Failure-Mode Runbook

9.1 Sync failed: sync_status='failed'

Symptom. Coordinator dashboard shows the "Failed Sync" tile with count > 0. Patient-portal banner reads "Your responses were saved on this device. The study team has been notified."

Triage. 1. Find the failed session:

SELECT id, participant_id, completed_at, sync_status FROM app.sessions
WHERE sync_status = 'failed' ORDER BY completed_at DESC;
2. Find the CRITICAL log row (filtered by source: "redcap_sync_failure"). Read error_type + error_message. 3. Investigate per error type: - api_error → REDCap unreachable or returning 5xx. Check REDCap status page; check token validity. - validation → REDCap rejected the payload. Read the error message; fix the field mapping or the source data. - exception → Metricis-side bug. Capture the full traceback, file an issue. - circuit_open → upstream is currently unreachable enough to trip the breaker. Wait for cooldown; investigate root cause. 4. Once the root cause is fixed, re-sync via POST /api/studies/{id}/sync for the affected sessions. The re-sync goes through the same canonical pipeline.

Do not. Disable the no-fallback invariant. Compute a Metricis-side substitute. Auto-retry without operator action. The §6 #14 invariant test refuses any service that branches on sync_status for a fallback decision — bypassing it requires an explicit allowlist edit which CODEOWNERS will flag.

9.2 Circuit breaker open

Symptom. Logs report REDCap circuit breaker opened for study <id>; subsequent sync attempts return success=false without touching PyCap.

Triage. 1. Confirm REDCap reachability: curl -I <redcap_url>. 2. If REDCap is down, wait for it to recover; the breaker will half-open after REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS and probe automatically. 3. If REDCap is up but Metricis can't reach it (DNS, firewall, cert), fix the network. 4. To force a manual reset (rarely needed):

from app.services.redcap_circuit_breaker import circuit_breaker
await circuit_breaker.reset(study_id)
This is a Python REPL action on a server pod, not an exposed endpoint by design.

9.3 Webhook secret rotation

Use cases: routine rotation, suspected secret leak, change of REDCap admin.

  1. Generate a new secret: python -c "import secrets; print(secrets.token_urlsafe(32))".
  2. Update REDCap project's DET secret first (REDCap sees it immediately).
  3. Update Metricis: portal → REDCap config → "Rotate webhook secret" (calls update_webhook_config, which encrypts via encrypt_token, §6 #19).
  4. Confirm by firing a test DET in REDCap (admin tools); Metricis should accept the signature. Wrong-secret signatures return 401 (asserted by tests/test_webhook_secret_encryption.py).

If a key rotation is needed (REDCAP_ENCRYPTION_KEY itself), follow the dual-key bridge: 1. Set REDCAP_ENCRYPTION_KEY_PREVIOUS=<current key>. 2. Set REDCAP_ENCRYPTION_KEY=<new key>. Restart. 3. Run python server/migrate_redcap_tokens.py --apply to re-encrypt every token onto the new key. 4. Unset REDCAP_ENCRYPTION_KEY_PREVIOUS. Restart.

The MultiFernet decrypt chain handles the bridge window (§6 #2, #7).

9.4 Anchor date shift mid-study

The coordinator changes a participant's enrollment date after some visits have completed. Expected behaviour (§6 #8):

  • Old ScheduleVersion is marked superseded; old completed visits stay on it.
  • Old completed visits get anchor_reconciled=true + original_target_date snapshot; their scheduled_date and actual_visit_date are NOT rewritten.
  • Old pending visits become status='cancelled' (soft-deleted, audit-preserved).
  • New ScheduleVersion v2 is created; new visits use the new anchor date.
  • An AuditLog row records the reconciled count and old/new anchor dates.

Operator action: confirm the audit-log row exists and the participant's portal schedule reflects the new anchor. No manual cleanup needed.

9.5 Stuck webhook_events.status='processing'

A row left in processing means _process_det_webhook started but didn't finish — likely a process restart mid-flight.

SELECT id, source, project_id, record_id, status, received_at, retry_count
FROM app.webhook_events WHERE status = 'processing' AND received_at < NOW() - interval '15 minutes';

Treat as a retry candidate: hand back to a coordinator-facing retry endpoint (currently a future enhancement; for now, a manual replay via the REDCap admin tools is the approved workaround). Do not auto-retry from a worker.


10. Rollback

The cutover is mostly reversible if caught early.

Action Reversible by
alembic upgrade head alembic downgrade -1 (per migration); test on staging first
Token encryption rotation REDCAP_ENCRYPTION_KEY_PREVIOUS bridge restores prior decrypt path
Webhook secret rotation Re-enter the previous secret in REDCap + Metricis
Study status='active' Revert to paused via portal; cleanly halts new enrolments without affecting existing participants
Bad participant created via DET Participant.status='withdrawn' + audit log explanation; do NOT delete (audit invariants)

A full DB rollback (point-in-time restore) is not an M7-scoped capability — that's a deployment-infra concern. Coordinate with the DB admin if this is needed.


11. Sign-off

When every Go/No-Go row is ✅, the engineer on standby and the study admin should each record a sign-off in the project's runbook log (or as an entry in the audit log via AuditLog(action="cutover_signoff")).

After sign-off, M7 ships and routine ops begin. Subsequent sponsor studies follow this same checklist; if any row needs to change for a different sponsor, update the checklist itself, not the per-study workflow.