Metricis Project Plan¶
Status: Active hub document — single source of truth for project vision, status, roadmap, and progress.
Last updated: 2026-05-08 (§6 #21 resolved — REDCapEventSyncService.sync_events_to_visit_windows now wraps its per-event loop in a db.begin_nested() SAVEPOINT and raises an internal _AtomicSyncFailure sentinel when any event errors, so partial state can no longer be persisted. The service no longer calls db.commit() directly — the FastAPI get_db dependency owns the outer transaction boundary, mirroring §6 #6. 7 atomicity tests in tests/test_redcap_event_sync_atomicity.py.)
Replaces: docs/project-plan/project-plan.md (archived 2026-05-06)
This is the hub. Deep specifications live in linked spoke documents and should be edited there; this document tracks status, sequencing, and progress.
1. Vision¶
Metricis is an AI-native, standards-first Electronic Data Capture (EDC) platform for regulated clinical trials, with particular strength in rare disease and pediatric research. It is ODM-aligned and assessment-centric, integrating jsPsych cognitive assessments, eConsent, REDCap interoperability, and governed AI co-pilots in a single backend that serves both site-facing (researcher) and participant-facing (patient/caregiver) experiences.
Core principles
- Single authoritative backend with versioned metadata and immutable audit trail
- Distinct frontends for site staff vs. participants/caregivers, sharing one API
- Strict role-based access control enforced at UI, API, and service layers
- ODM-informed data model with explicit version binding (battery, metadata, consent)
- AI agents as governed co-pilots — drafts only, human approval required
- Compliance: ICH-GCP, Health Canada, FDA, EMA, 21 CFR Part 11, HIPAA
2. Architecture at a Glance¶
┌──────────────────────────────────────────────────────────────────┐
│ Portal (React) │ Client (jsPsych) │ Patient Portal (React+Capac.)│
└────────┬───────────────┬────────────────────┬────────────────────┘
└───────────────┼────────────────────┘
▼
FastAPI Backend
┌──────────┬──────────┬──────────┬──────────┬───────────────┐
│ Study │ Metadata │ Consent │ Assess- │ RBAC / Audit │
│ Runtime │ Version │ Gate │ ments │ (21 CFR Part11)│
└──────────┴──────────┴──────────┴──────────┴───────────────┘
┌──────────┬──────────┬──────────┬──────────┬───────────────┐
│ Schedule │ REDCap │ Forms / │ AI Agent │ Registries / │
│ (Unified)│ Sync │ ODM │ (Phase 2)│ Phenotypes │
└──────────┴──────────┴──────────┴──────────┴───────────────┘
▼
PostgreSQL + Redis + Celery
For detailed architecture, see CLAUDE.md and the spec spokes in §9.
3. Status by Subsystem¶
Legend: ✅ complete · 🔄 in progress · 🔜 planned · ⚠️ has known issues
| # | Subsystem | Status | Evidence |
|---|---|---|---|
| 1 | Study runtime (enrollment, visits, sessions, data entry) | ✅ | server/app/routers/, portal/src/pages/ |
| 2 | Metadata governance (versioned design, draft→approve→publish) | ✅ | server/app/services/metadata_service.py |
| 3 | jsPsych assessments (battery versioning, queue, reconciliation) | ✅ | server/app/services/assessment_queue_service.py |
| 4 | eConsent (versioning, signatures, gating, re-consent) | ✅ | consent_service.py, reconsent_service.py |
| 5 | RBAC + Audit (21 CFR Part 11) | ✅ | audit_integrity.py (hash chain), CODEOWNERS |
| 6 | SDTM/ODM/Define-XML export | ✅ | routers/regulatory.py, sdtm_validation_service.py |
| 7 | Validation services (Pinnacle 21-style) | ✅ | sdtm_validation_service.py |
| 8 | Researcher Portal (38+ pages) | ✅ | portal/src/pages/, docs/PAGES.md |
| 9 | Patient/Caregiver Portal (magic link, mobile-first) | ✅ | patient-portal/src/, routers/portal.py — REDCap-managed data ingestion landed 2026-05-07: task_type="survey" with external_url launcher, is_stale flag on visit schedule, GET /api/portal/randomization (§6 #4 + #20) |
| 10 | Rare disease & pediatric (Phases 1–4) | ✅ | rare-disease-integration.md |
| 11 | AI Agent Phase 1 (foundation: models, LLM, ChangeSet) | ✅ | llm_service.py, agent_service.py, changeset_validator.py |
| 12 | AI Agent Phase 2 (protocol ingestion, StudyAssistant UI) | ✅ | agent_tasks/protocol_ingestion.py, routers/ai_assistant.py, pages/StudyAssistant.tsx |
| 13 | Patient Registry (Phases 1–4: foundation, linkage, analytics, UI) | ✅ | participant-registry.md |
| 14 | Unified Scheduler (mode-aware: legacy vs EDC) | ✅ | unified_scheduler.py, consent_gate.py |
| 15 | Anchor date / enrollment date system | ✅ | enrollment_date_service.py, schedule_versioning_service.py |
| 16 | Form workflow state machine (NOT_STARTED → LOCKED) | ✅ | form_workflow_service.py |
| 17 | Quality flags & validation rules | ✅ | ValidationRule, ValidationResult models |
| 18 | REDCap sync (fail-safe, no-fallback policy) | ✅ | redcap_sync.py — invariant test enforces no-fallback; failed-sync surfaced in patient portal banner + coordinator dashboard alert (§6 #14, fixed 2026-05-07) |
| 19 | REDCap DET webhooks | ✅ | routers/webhooks.py — HMAC, RBAC/audit/rate-limit, production test-endpoint gate, idempotency + persistence (WebhookEvent) all landed 2026-05-07 |
| 20 | Randomization module (full stack, Metricis-managed mode) | ✅ ⚠️ deferred | randomization_*.py, pages/Randomization*.tsx — bugs in §6; deferred (M9) since REDCap-managed studies use REDCap randomization |
| 21 | Business Day Service (weekend/holiday) | ✅ ⚠️ deferred | business_day_service.py — see §6, deferred to M9 |
| 22 | Token encryption (REDCap tokens) | ✅ | token_encryption.py — production refuses missing key + plaintext reads; enc:v2: + key rotation + migrate_redcap_tokens.py; 15 tests (§6 #2 + #7, fixed 2026-05-07) |
| 23 | Time simulation (test mode) | ✅ | TimeSimulationBanner.tsx, routers/testing.py — service + router gates inert in production; admin + dev-mode required on every mutation with AuditLog; 3 invariant tests (§6 #1, fixed 2026-05-07) |
| 24 | Dev mode service & bypasses | ✅ | dev_mode.py, DevModeContext.tsx |
| 25 | Compliance invariant tests | ✅ | tests/test_compliance_invariants.py |
| 26 | Audit log integrity (SHA-256 hash chain) | ✅ | audit_integrity.py |
| 27 | Capacitor mobile (iOS/Android, push) | ✅ | capacitor.config.ts, FCM/APNs setup |
| 28 | CI/CD (GitHub Actions, path-based, nightly) | ✅ | .github/workflows/ |
| 29 | AI Agent Phase 3 (Battery & Assessment Configuration) | 🔄 | ai-agent-implementation-plan.md — battery_config task shipped 2026-05-10 (agent_tasks/battery_config.py, 9 tests); event_linking task + portal inline assist follow-ups |
| 30 | AI Agent Phase 4 (Amendment Impact Analysis) | 🔜 | ai-agent-implementation-plan.md |
4. Active Workstreams¶
4.1 REDCap-managed study production readiness (M7 — engineering complete; sponsor sign-off pending)¶
Status (2026-05-10): every §6 item resolved; cutover checklist landed at docs/guides/sponsor-study-cutover.md. The engineering portion of M7 is closed. The remaining gate is operational: a sponsor-study cutover walked through the checklist with sign-off recorded against the audit log. Until that ships, this workstream is in cutover-review limbo, not in active development.
Goal: ship a production-grade Metricis deployment for a study where REDCap is the source of truth for randomization, data collection forms, events, and visit windows, and Metricis delivers the patient/caregiver experience (assessments, eConsent, task queue, reminders) on top.
What landed (cross-reference §6 entry numbers):
- Gating functions — consent gate, dev-mode gates, time-simulation gate all inspection-grade with router-level production 404s, audit-logged mutations, and structural invariants (§6 #1, #3).
- Token encryption for REDCap tokens AND webhook secrets —
enc:v2:write format, no JWT-secret fallback in production,MultiFernetrotation, one-shot migration script (§6 #2, #7, #19). - REDCap sync robustness — fail-safe no-fallback policy enforced by structural invariant; transactional boundaries on event sync and schedule versioning; per-study circuit breaker; single retry layer; DET webhook idempotency + replay protection (§6 #14, #6, #21, #22, #23, #17).
- Patient portal REDCap ingestion — survey task type with external launcher, stale-data flag on visit schedule, randomization read-only display with masked-label preference for blinded studies (§6 #4, #20).
- Test coverage — 10-test E2E lifecycle plus dedicated suites for retry/breaker, anchor-shift reconciliation, project_id index, webhook idempotency, RBAC, token encryption, and compliance invariants.
4.2 AI Agent Phase 3 (active, M8)¶
Phase 2 (protocol ingestion) shipped on 2026-01-22. Phase 3 (battery configuration assistance) is the active workstream now that M7 engineering is closed. Spec: ai-agent-implementation-plan.md.
Status (2026-05-10):
- ✅ Task 1 of 2 — battery_config AI task: reads a battery plan + optional protocol, proposes Battery rows with ordered Module lists and event_battery_assignment items, anchors event names to existing VisitWindow.name values when available, surfaces burden warnings in the ChangeSet summary. Wired into AgentService.execute_task dispatch. 9 unit tests in tests/test_agent_battery_config.py covering happy path, visit anchoring on/off, missing required document, malformed LLM output, unexpected exceptions, and dispatch wiring. Read-only: produces ChangeSet items with artifact_type="battery_version" and "event_battery_assignment"; no mutation of live Battery/BatteryModule tables (the apply path is shared with Phase 2 and remains a TODO in routers/ai_assistant.py::apply_changeset).
- 🔜 Task 2 of 2 — event_linking task: takes existing visits + batteries, proposes optimal pairings, identifies timing conflicts and burden concerns. Builds on the same prompt + ChangeSet scaffolding as battery_config.
- 🔜 Portal inline assists: "Suggest module selection" button in BatteryBuilder.tsx; tab additions in StudyAssistant.tsx for the new task type.
4.3 Metricis-managed mode hardening (deferred, M9)¶
Randomization (full-stack), Business Day Service, and Schedule Versioning have outstanding bugs documented in §6 #9–13. They are deferred until Metricis-managed studies become a target. The first sponsor study runs in REDCap-managed mode where REDCap handles randomization and scheduling, so these issues do not block M7 cutover.
4.4 Documentation consolidation (ongoing)¶
Part of M6. CLAUDE.md condensed 2026-05-06; this hub document is the planning consolidation deliverable. M7-specific operator documentation landed 2026-05-10: sponsor-study-cutover.md.
5. Roadmap¶
| Horizon | Milestone | Focus |
|---|---|---|
| Now (current) | M8 — AI Agent Phase 3 (Battery Configuration) | Inline assists in BatteryBuilder, event-linking proposals, validation pipeline |
| Operational gate | M7 — REDCap-managed study production readiness (engineering complete; sponsor sign-off pending) | Walk the cutover checklist with the first sponsor study; record sign-off in the audit log |
| Deferred | M9 — Metricis-managed mode hardening | Randomization fixes, Business Day Service, schedule versioning transaction boundary — unblocked once Metricis-managed studies become a priority |
| Later | M10 — AI Agent Phase 4 (Amendment Impact Analysis) | Cutover planning, narrative generation per affected participant |
| Backlog | First sponsor study, multi-site rollout, expanded SDTM domains, additional languages | — |
6. Known Issues & Risks¶
Findings from the 2026-05-06 code review. Severity prefixes: 🔴 critical, 🟠 significant, 🟡 minor. Status as of 2026-05-10: every item that blocks M7 is resolved. Items are grouped by whether they blocked M7 (REDCap-managed study production readiness) or are deferred to M9 (Metricis-managed mode hardening).
Blocked M7 — REDCap-managed production readiness (✅ all resolved)¶
🔴 Critical (gating + safety/security)¶
- Time simulation is not gated to dev mode and has no audit log. ✅ Fixed 2026-05-07
server/app/services/scheduler.py:40-63shifted the scheduler's "today" byStudy.test_mode_config.time_simulation_offset_daysregardless ofENVIRONMENT. The mutation endpoints inserver/app/routers/testing.py:178,219required onlyget_current_user— no admin role, noDevModeService.is_dev_mode_available()gate. A production user could have shifted any study's clock and broken participant-facing scheduling. Resolution: - Service-layer gate —
services.scheduler._time_simulation_allowed()consulted byget_effective_date(study). Honours the offset only whenENVIRONMENT != "production"orDEV_MODEis explicitly set, so a stale persisted offset cannot affect production participants. Therouters/portal.pytest-mode-info computation is gated by the same predicate. - Router-layer gate — every mutation in
routers/testing.py(update_test_mode_config,update_time_simulation,reset_time_simulation,update_study_visit_statuses,generate_synthetic_data,clear_synthetic_data,set_bulk_anchor_dates,generate_longitudinal_data) requires_require_dev_mode_admin— admin role and dev-mode availability — and emits anAuditLogrow via new_audit_testing_actionhelper (withaudit_metadata.source = "dev_mode_bypass"). - Router-level production gate —
_testing_router_production_gate404s every/api/testing/*call in production, applied as a router-level dependency. 404 (not 403) avoids disclosing route existence. - Invariant tests (
tests/test_compliance_invariants.py::TestTimeSimulationProductionGate): (a)get_effective_datereturnsdate.today()in production regardless of persisted offset; (b) sanity check that dev environments still observe the offset; (c) structural test walking the FastAPI dependency tree to assert every mutation inrouters/testing.pyis gated by_require_dev_mode_admin, not the looserget_current_user. -
Test coverage: 3 invariant tests, all green.
-
Token encryption silently falls back to plaintext and to
JWT_SECRET_KEY. ✅ Fixed 2026-05-07server/app/services/token_encryption.py:31-37derived the Fernet key from the JWT secret whenREDCAP_ENCRYPTION_KEYwas unset (warning only, not error).:62-65returned plaintext when the stored value lacked theenc:v1:prefix. No migration path for legacy plaintext tokens. No tests despite commit message claiming coverage. Resolution: - Production refuses to operate without an explicit key —
encrypt_token()raisesTokenEncryptionConfigErrorwhenREDCAP_ENCRYPTION_KEYis unset andENVIRONMENT=production. The JWT-secret fallback remains for non-production only (with a single WARNING per process). Reusing one key across two domains was a latent risk: a leak of either compromised both. - Plaintext reads rejected by default in production —
decrypt_token()raisesTokenEncryptionErrorfor any stored value lacking anenc:prefix. A one-time bridge during migration is available viaREDCAP_ENCRYPTION_ALLOW_PLAINTEXT_READS=true, which emits CRITICAL logs withrequires_investigation: Trueso it cannot be quietly forgotten. - Format versioning —
enc:v2:is the new write format;enc:v1:ciphertexts remain readable (same KDF, version reserved for a future format change). Idempotency: re-encrypting an already-prefixed input is a no-op. - Key rotation —
REDCAP_ENCRYPTION_KEY_PREVIOUSadds a second key to the decrypt chain viacryptography.fernet.MultiFernet. Encryption always uses the current key; the optional previous key bridges the rotation window. Newrotate_token(stored)re-encrypts onto the current key for migration runs. - One-shot migration script —
server/migrate_redcap_tokens.pyscans everyStudy.config['redcap']['api_token'], classifies vianeeds_rotation(), and rewrites legacy/plaintext/enc:v1:values toenc:v2:. Defaults to dry-run;--applywrites. -
Test coverage: 15 tests in
tests/test_token_encryption.pycovering round-trip, idempotency, production-without-key raises, plaintext-read rejected/allowed paths,enc:v1:back-compat, key rotation (previous-key fallback +rotate_tokenupgrade + wrong-key error), andneeds_rotationclassifier (v2/v1/plaintext/empty). -
All gating functions need a unified production-readiness audit. ✅ Fixed 2026-05-07 Beyond #1 and #2, every dev-mode bypass surface (
dev_mode.py,routers/dev.py,routers/testing.py, force-consent-bypass endpoints) needed a single test that proves no path bypasses production guards. Resolution: - Router-level production gates added to
routers/dev.py(_dev_router_production_gate) androuters/testing.py(_testing_router_production_gate). Both 404 the entire surface in production unlessDEV_MODE=trueis explicitly set. 404 (not 403) so a leaked URL list cannot confirm route existence. - Unified test suite —
tests/test_production_gating.py(25 tests):DEV_TEST_ROUTEStable enumerates every/api/dev/*,/api/testing/*, and/api/webhooks/redcap/det/testendpoint. The router-level test parametrises over the table so every new dev/test endpoint must be added there to pass — making the table the single point of update.- Service-level tests assert defence in depth:
DevModeService.is_dev_mode_available()/bypass_consent_check()return False;set_anchor_date_override/trigger_scheduled_message_nowraiseDevModeError;services.scheduler._time_simulation_allowed()returns False;get_effective_dateignores any persisted offset. - Token-encryption tests confirm
encrypt_token()raisesTokenEncryptionConfigErrorwithoutREDCAP_ENCRYPTION_KEYanddecrypt_token()rejects plaintext by default in production.
- Autouse fixture
_force_production_environmentsetsENVIRONMENT=production, unsetsDEV_MODE/REDCAP_ENCRYPTION_KEY/REDCAP_ENCRYPTION_ALLOW_PLAINTEXT_READS, and bustsget_settings.cache_clear()so each test sees a clean production environment. - Test coverage: 25 tests, all green.
🟠 Significant (REDCap path)¶
- Patient portal needs REDCap data ingestion expansion. ✅ Fixed 2026-05-07 (with #20) Portal previously delivered only Metricis-side schedules; for REDCap-managed studies it now also surfaces REDCap-driven event schedules, REDCap survey invitations, completion state, and randomization assignment — with the REDCap fail-safe policy extended to the portal (stale-data indicators, never invented state). Resolution:
- REDCap event schedules —
VisitScheduleItem(routers/portal.py) carriesredcap_event_name,redcap_sync_status,last_redcap_sync_at,is_stale.is_stalederives from_is_visit_stale(): true whenredcap_sync_status="failed", never reconciled, OR last successful sync >24h ago. Computed only for studies whereis_redcap_managed(study)is true; Metricis-mode visits never carry these fields. - REDCap survey invitations — new
task_type="survey"onPortalTaskwithexternal_url,redcap_event_name,redcap_instrument_name,redcap_repeat_instance,redcap_record_id,last_synced_at,is_stale, and a genericextra_dataJSONB (migratione3f4a5b6c7d8).PortalTaskService.create_survey_task()is idempotent on(participant, instrument, event_name, repeat_instance)so re-issuing an invitation refreshes the row.start_taskredirects to the storedexternal_url; if absent the endpoint returns503 Service Unavailablerather than a broken navigation. Wired to consume URLs fromredcap.py:925 generate_survey_link(REDCapService.generate_survey_link via the REST endpoint). - Randomization read-only display — new
GET /api/portal/randomizationendpoint. Source priority:RandomizationAllocation(preferred, includesmasked_labelfor blinding) →Participant.armfallback (typical for REDCap-managed studies where REDCap performs randomization). Returnsrandomized=Falsewhen no assignment — portal must not render a placeholder. The masked label is preferred when present so blinded studies never leak the underlying arm name to participants. No edit path. - Fail-safe extension — survey task without an
external_urlreturns 503 (notredirect_url=null). Stale flag is sticky until next successful sync. The portal renders stale indicators rather than substituting Metricis-computed state for REDCap state. -
Test coverage: 14 tests in
tests/test_portal_redcap_ingestion.pycovering survey task creation/idempotency/start,start_task503 fail-safe,is_stalefor failed/never-synced/recent/old sync, Metricis-mode visits never marked stale, randomization withRandomizationAllocation(blinded label preferred),Participant.armfallback for REDCap-managed studies, unauthenticated rejection. All green. -
No end-to-end test for REDCap-managed study lifecycle. ✅ Fixed 2026-05-07
redcap_sync.py,redcap_*.pyservices had unit coverage but no integration test exercised the full pipeline together. The first run of the E2E test surfaced two pre-existing critical defects inredcap_sync._build_submission_from_responsesthat had been latent since a model refactor — neither defect would have been caught without this test. Resolution: tests/test_redcap_managed_lifecycle.py— 10 tests in 5 stages mirroring the production lifecycle:- Event sync — REDCap
export_events+ event-instrument mappings →VisitWindowrows viaREDCapEventSyncService. Covers idempotency on re-sync. - Visit scheduling —
UnifiedSchedulerService.schedule_visits_for_participantdispatches REDCap-managed studies toSchedulingMode.LEGACYand createsScheduledVisitrows with REDCap event provenance. - Patient-portal delivery —
/api/portal/schedulecarriesredcap_event_name/redcap_sync_status/is_stale;/api/portal/randomizationshowsParticipant.armfallback withsource="redcap";/api/portal/taskssurfaces survey tasks withexternal_url. - Completion sync (success path) — patched
REDCapService._get_projectreturns a MagicMock whoseimport_recordsreturns success;Session.sync_statusflips to"synced". - Completion sync (failure path) — same mock raises
RuntimeErroron every retry;sync_status="failed",success=False, no Metricis fallback computed (§6 #14 invariant). Portaldata-statusthen surfaceshas_failed_sync=True; visit row showsis_stale=True.
- Event sync — REDCap
- REDCap I/O is mocked at two boundaries so the test runs offline in CI:
app.services.redcap_event_sync.get_redcap_events/get_redcap_event_instruments(module-level fetchers) andREDCapService._get_project(lazy PyCap project handle). - Pre-existing defects discovered and fixed:
redcap_sync._build_submission_from_responsesimportedAssessmentMetadatafromapp.models.cognitive_data, but that class had been renamed toSessionMetadata. Any caller ofREDCapSyncService.sync_session()would have raisedImportErrorin production. Fixed.- The same function constructed a default
SimpleRTSummarywithoutmin_rt/max_rt, both of which became required in a later schema revision. Pydantic would have raisedValidationErrorand the sync path would have masked it assync_status="failed". Fixed.
-
Test coverage: 10 tests, all green. Exercises every production codepath the upcoming sponsor study will travel.
-
Schedule versioning lacks transaction boundary. ✅ Fixed 2026-05-08
schedule_versioning_service.pyissued multipleflush()calls without an atomic boundary. Worse, the legacySchedulerService.schedule_visits_for_participantand EDCVisitService.schedule_visits_for_participantboth ranawait self.db.commit()mid-flow — meaning the just-insertedScheduleVersionrow was prematurely persisted before the subsequent linking, anchor back-reference, and audit log steps could complete. A failure in any later step left an orphan version pointing at zero visits or an audit log row with no version. Resolution: scheduler.pyandvisit_service.py—schedule_visits_for_participantacceptsauto_commit: bool = True. WhenFalse, the innerdb.commit()is replaced with adb.flush()(and the post-commit refresh loop is skipped — IDs are already populated by the Python-side UUID default at flush time). Default behaviour is unchanged for the dozen-plus existing call sites inrouters/schedules.py,workers/reminder_worker.py, etc.unified_scheduler.py—schedule_visits_for_participant,_schedule_via_legacy, and_schedule_via_edcpropagate the flag.schedule_versioning_service.py— both public methods (create_schedule_version,regenerate_schedule) wrap their bodies inasync with self.db.begin_nested():(SAVEPOINT) and call the unified scheduler withauto_commit=False. A reported scheduling failure raises an internal_AtomicVersioningFailureso the SAVEPOINT rolls back without leaking the exception to the caller; the method then returns a structuredVersioningResult(success=False, ...). Behaviour change: a failed scheduling result no longer persists astatus="failed"ScheduleVersion row — orphan versions with zero visits were exactly the partial state §6 #6 set out to prevent. Code-wide scan found no consumer ofScheduleVersion.status == 'failed'.- The caller's outer commit boundary is preserved: FastAPI's
get_dbdependency still owns the top-levelcommit()/rollback(). Versioning runs inside the request transaction; the SAVEPOINT only protects intra-method atomicity. - Test coverage:
tests/test_schedule_versioning_atomicity.py— 8 tests in 3 classes:- TestCreateScheduleVersionAtomicity — success path (version + visits + anchor link), scheduler-failure rollback (assert no orphan version, no visits, anchor
current_schedule_version_idstays None), unexpected-exception rollback (RuntimeError propagates AND no rows persist),auto_commit=Falsecontract assertion. - TestRegenerateScheduleAtomicity — success path (v1 superseded → v2 active, audit log written), scheduler-failure rollback (old v1 remains
is_current=True/status="active", no new version, no audit log, anchor still points at v1),auto_commit=Falsecontract assertion. - TestSchedulerAutoCommitContract —
SchedulerService.schedule_visits_for_participant(auto_commit=False)invokesflush≥ 1 andcommit0 times.
- TestCreateScheduleVersionAtomicity — success path (version + visits + anchor link), scheduler-failure rollback (assert no orphan version, no visits, anchor
- All 8 tests green. Existing scheduler/participant/dev-mode test suites unaffected (49 tests verified).
🟡 Minor¶
- No key versioning for token encryption. ✅ Fixed 2026-05-07 (folded into #2):
enc:v2:is the current write format and shares the KDF withenc:v1:so existing ciphertexts remain readable; rotation viaREDCAP_ENCRYPTION_KEY_PREVIOUS+rotate_token()is now supported. - No reconciliation regression test for completed visits when anchor date shifts mid-study. ✅ Fixed 2026-05-10
ScheduleVersioningService.regenerate_schedulealready implemented the right contract (_reconcile_completed_visitspreserves completed/missed visits withanchor_reconciled=True+original_target_datesnapshot;_delete_pending_visitssoft-cancels still-pending visits;_link_visits_to_versiononly links the freshly-generated visits to v2; old completed visits remain on v1) — but no regression test pinned the contract. Drift in any one of_reconcile_completed_visits/_delete_pending_visits/ supersession / audit-metadata wiring would silently corrupt an in-flight study's audit trail without raising. Resolution: tests/test_anchor_shift_reconciliation.py— 12 tests across 4 classes:- TestCompletedVisitsPreservedAcrossAnchorShift (4) — completed visit keeps
status="completed"+actual_visit_date+scheduled_date; getsanchor_reconciled=True+anchor_reconciled_at+original_target_datesnapshot of the pre-shifttarget_date; stays linked to v1 (NOT moved to v2 — that would rewrite history);status="missed"visits get the same treatment (parity). - TestPendingVisitsCancelledOnAnchorShift (3) — pending rows soft-deleted to
status="cancelled"and remain queryable on v1 for audit;overduevisits also cancelled (and NOT stamped with the reconciled flag — overdue is a pending state, not a historical one); v2 visits' earliestscheduled_date≥ new anchor date. - TestAnchorShiftVersionAndAuditWiring (2) — v1 marked
is_current=False/status="superseded"/superseded_by_id=v2.id; anchor'scurrent_schedule_version_idmoves to v2;schedule_regeneratedaudit row written withaudit_metadata.reconciled_visitscount +old_values.anchor_date+new_values.anchor_date. - TestAnchorShiftIdempotencyAndEdgeCases (3) — second anchor shift does NOT overwrite
original_target_date(the very first protocol-intended date is the source of truth, the audit trail of what the participant was originally scheduled for); zero-completed edge case yieldsresult.visits_reconciled=0and no visit carries the flag;result.visits_reconciledcount matches DB-state count (guards against the count-vs-state divergence that would make audit metadata lie).
- TestCompletedVisitsPreservedAcrossAnchorShift (4) — completed visit keeps
- All 12 tests green; no regressions across
test_schedule_versioning_atomicity.py(8),test_redcap_managed_lifecycle.py(10),test_redcap_det_sync_versioning.py(7),test_redcap_event_sync_atomicity.py(7),test_redcap_retry_and_circuit_breaker.py(11) — 55 tests total verified.
🔴 Critical (REDCap path — discovered 2026-05-06 review #2)¶
-
sync_status="failed"is set but never read by downstream consumers. ✅ Fixed 2026-05-07server/app/services/redcap_sync.py:341marks failed sessions, but a repo-wide scan finds zero readers in notification, scheduling, portal, or worker code. The "no fallback" invariant holds today only because no fallback was ever written — fragile against future contributions. Resolution:- Structural invariant (
tests/test_compliance_invariants.py::TestREDCapSyncNoFallbackInvariant) — three tests: (a) AST-style scan rejects any unsanctioned reader ofSession.sync_statusoutside an explicit allowlist (writers + display-only passthroughs); (b) confirmsredcap_sync.pystill emitsfallback_used: Falseandrequires_investigation: Trueon failure; (c) refuses any module in the REDCap sync failure path that flipsfallback_used: True. - Patient portal banner — new
GET /api/portal/data-statusreturns{has_failed_sync, failed_sync_count, last_successful_sync_at}scoped to the authenticated participant. NewSyncFailureBannercomponent renders a calm, non-actionable amber banner on Home (wording: "Your responses were saved on this device. The study team has been notified … No action is needed from you."). Polls every 5 min so coordinator re-sync clears the banner automatically. - Coordinator dashboard alert —
DashboardStats.failed_syncexposes the count of failed-sync sessions; portal dashboard renders a redfailed-sync-alerttile (only when count > 0) labeled "Failed Sync — needs investigation". - Test coverage: 3 invariant tests + 7 portal endpoint tests (incl. cross-participant isolation, unauthenticated rejection, pending vs failed disambiguation) + 3 dashboard tests = 13 tests, all green.
- Structural invariant (
-
Study.integration_modeis checked only inunified_scheduler.py. ✅ Fixed 2026-05-07server/app/db/models.py:131defines the column;unified_scheduler.pywas the only service that branched on it. Every other service (submit.py:212,redcap_sync.py:114,192,redcap_det_sync.py:402,webhooks.py:469, all 7 gates inrouters/redcap.py) gated onredcap_enabledinstead, leaving the architectural promise unenforced. Resolution:- Helper (
server/app/services/study_classification.py::is_redcap_managed) is the canonical gate. Precedence:integration_mode == "redcap"→ True;integration_mode == "metricis"→ False; else fall back toredcap_enabledwith a structured WARNING log so legacy/unset rows are visible. Also exportsis_metricis_managedas the negation. - Call sites migrated:
submit.py:212,redcap_sync.py:114,192,redcap_det_sync.py:402, all 7 gating sites inrouters/redcap.py.routers/webhooks.py::_find_study_by_redcap_projectnow filters onStudy.integration_mode == "redcap"directly. API passthroughs (routers/studies.py, response models inrouters/redcap.py) keep surfacingredcap_enabledfor backward compat — these are display-only and allowlisted. - Structural invariant (
tests/test_compliance_invariants.py::TestIntegrationModeRedcapEnabledConsistency): scansserver/app/**/*.pyforredcap_enabledreferences and refuses any reader outside an explicit allowlist (column def, helper, two API-surface routers). Adding a new entry requires a one-line justification recorded in the dict. - Row-level invariant (same class): asserts every
Studyrow satisfiesintegration_mode == "redcap"⇔redcap_enabled == True. Two existing test fixtures (test_redcap_rbac.py,test_webhook_idempotency.py) updated to setintegration_mode="redcap"alongsideredcap_enabled=Trueto satisfy the invariant. - Helper unit test (same class): documents precedence by example for all 8 combinations of
(integration_mode, redcap_enabled)plus theNonestudy case. - Test coverage: 3 new tests + 2 fixture migrations, all green.
- Helper (
-
REDCap router has zero RBAC and zero rate limiting. ✅ Fully resolved 2026-05-07
server/app/routers/redcap.py— all 40+ endpoints use onlyDepends(get_current_user). Any authenticated user can rotate API tokens (:290), push data dictionaries (:612, :1410), delete REDCap forms (:1515), or rewrite webhook secrets (:2443) for any study, without study-membership check or admin role. No@limiter.limit()decorators on any endpoint, including the unauthenticated DET webhook inrouters/webhooks.py. Resolution:- Reads (22 endpoints) gated to
require_role("admin", "researcher", "coordinator") - Mutations (12 endpoints) gated to
require_role("admin", "researcher") - High-risk endpoints (7: token rotation, init, dict push x2, form delete, participant import, webhook secret update) require explicit per-study admin/owner UserStudy membership via new
_require_study_admin_membershiphelper. Stricter than the platform defaultverify_study_access— does NOT honour the globalUser.role == "admin"bypass, so a typoedstudy_idby a system admin cannot rotate the wrong project's REDCap token. - All 7 high-risk endpoints emit
AuditLogrows via new_audit_redcap_actionhelper; secret values (api_token, webhook_secret) never logged — only*_rotated: boolflags - DET webhook rate-limited via per-router
webhook_limiter(configurable, default 120/min) - DET test endpoint rate-limited (10/min default) and returns 404 when
ENVIRONMENT=production - Test coverage: 25 tests in
tests/test_redcap_rbac.pycovering RBAC matrix (read/mutation/high-risk), per-study membership (admin/owner allowed; coordinator/data_entry/researcher/viewer blocked; admin-on-other-study blocked), audit log emission with secret-redaction assertions, DET production gate.
- Reads (22 endpoints) gated to
-
DET webhook lacks idempotency and replay protection. ✅ Fixed 2026-05-07
server/app/routers/webhooks.py:311— noWebhookEventtable, no nonce/timestamp check. REDCap retries on 5xx will re-fire_process_det_webhook; the visit creation pathredcap_det_sync.py:266-322deduplicates only on(participant_id, redcap_event_name)and will double-create on schedule edits. Resolution:- New
WebhookEventmodel (server/app/db/models.py) with composite dedup index over(source, project_id, record_id, instrument, event_name, redcap_repeat_instance, payload_hash, received_at). Migrationd2e3f4a5b6c7_add_webhook_events.py. - DET endpoint computes a SHA-256 payload hash (canonicalized via sorted-key JSON) and looks for an existing event with matching dedup keys whose
received_atis within a configurable TTL (redcap_det_idempotency_ttl_seconds, default 24h) ANDstatus∈ (pending, processing, processed). If found, the duplicate is persisted asstatus="duplicate"withduplicate_of_idpointing at the original — duplicates are auditable, not silently dropped. - WebhookEvent is persisted BEFORE queuing the background task: it is the canonical record for retry/DLQ. The
BackgroundTaskscall is best-effort; a crash leaves the row inprocessing, and a future retry worker can resume by status. _process_det_webhooknow takeswebhook_event_idand writes terminal status (processed/failed) withprocessed_at,error_message,retry_count. Three independent transactions: mark processing → run sync → record terminal status — so partial failure leaves observable state.- Outstanding: background processing still uses FastAPI
BackgroundTasks. Moving to Celery for true at-least-once delivery and a coordinator-facing retry endpoint are separate enhancements. - Test coverage: 7 tests in
tests/test_webhook_idempotency.py(first-receipt persistence, replay→duplicate, different record/instrument processed separately, different payload hash processed separately, payload-hash determinism under key reorder, payload-hash sensitivity to value changes).
- New
🟠 Significant (REDCap path — discovered 2026-05-06 review #2)¶
-
redcap_det_sync._create_visit_schedulebypasses UnifiedScheduler and ConsentGate. ✅ Fixed 2026-05-08 The dead-code path was even broken:_create_visit_scheduleinstantiatedScheduledVisit(window_open=..., window_close=...)against columns that don't exist on the model — any call would have raisedTypeError. The "fallback"_create_visits_from_windowsused the right field names but skipped the ConsentGate, business-day rules, and schedule versioning entirely. The default_get_default_field_mappingalso mappedfirst_name/last_nameinto theParticipant(**participant_data)constructor, but those columns don't exist onParticipanteither — every DET sync would haveTypeError'd on participant construction before ever reaching the visit code. Resolution:- Removed both
_create_visit_scheduleand_create_visits_from_windows. Field-name drift can no longer recur — there is no second writer ofScheduledVisit. - Added
_seed_anchor_date_and_schedule(participant, redcap_config)which: (a) idempotently creates aParticipantAnchorDaterow (status=finalized, source_type="redcap_det", unique onparticipant_id); (b) delegates toScheduleVersioningService.create_schedule_version()which routes throughUnifiedSchedulerService→SchedulerService→ SAVEPOINT-atomic versioning (§6 #6). Visits inherit consent gate enforcement, business-day rules, REDCap event-name provenance from the syncedVisitWindowrows, and audit trail. - Legacy
event_battery_mappingconfig is detected and logged as a structured deprecation warning (extra={"deprecation": "event_battery_mapping"}) but no longer bypasses scheduling — visits come fromVisitWindowrows seeded byREDCapEventSyncService(§6 #5). - Default field mapping fixed: removed
first_name/last_name(not MetricisParticipantcolumns; names live inextra_data). The mapping is now strictly columns Metricis recognises. - Webhook return contract enriched:
sync_participant_from_detnow surfaces ascheduling: {success, visits_created, warnings, error}block on the create path so the DET background processor and any future retry surface have observable outcomes. Update path is unchanged (no scheduling on update). - Test coverage:
tests/test_redcap_det_sync_versioning.py— 7 tests in 5 classes: - TestDETSyncRoutesThroughVersioning — happy path: ParticipantAnchorDate (
source_type="redcap_det",status="finalized"), ScheduleVersion (is_current=True,status="active"), ScheduledVisit rows with canonical schema field names (scheduled_date,window_start,window_end), all linked to the version. - TestDETSyncRespectsConsentGate —
consent_mode="digital": scheduling returnssuccess=Falsewith consent error; participant still created (so coordinators see them); no orphan ScheduleVersion. - TestDETSyncIdempotency — second DET fire updates participant fields but creates no second anchor / version / visits.
- TestDETSyncLegacyConfigDeprecated —
event_battery_mappinginjected: deprecation warning logged AND visits still come from VisitWindow rows (count matches the no-shortcut path). - TestDETSyncStructuralCleanup — asserts
_create_visit_scheduleand_create_visits_from_windowsare gone; asserts_seed_anchor_date_and_scheduleand_ensure_anchor_dateexist. - TestDETSyncCallsUnifiedScheduler — spies
UnifiedSchedulerService.schedule_visits_for_participantto verify the canonical pipeline is invoked with the correct participant. - All 7 tests green; no regressions across §6 #5/#6/#17/#18 + unified scheduler suites (45 tests verified).
- Removed both
-
Webhook secret stored plaintext in
study.config. ✅ Fixed 2026-05-08server/app/routers/redcap.py:2727previously wrotewebhook_secretstraight into the JSONB config without going throughencrypt_token()— asymmetric withapi_token(:405). Any operator with read access to the studies table or a database backup could recover the HMAC secret used to authenticate REDCap DET callbacks; rotating the secret left the previous value in plaintext until overwritten. Resolution:- Write site (
routers/redcap.py:update_webhook_config) —update.webhook_secret, when set, is wrapped throughencrypt_token()before being stored. An empty payload string (treated as "clear the secret") falls through unchanged so the stored value never becomesenc:v2:<empty-ciphertext>.encrypt_token()is idempotent onenc:v1:/enc:v2:inputs, so a re-PATCH carrying an already-encrypted value (theoretical caller path) does not double-wrap. - Read site (
routers/webhooks.py::_process_det_webhook) — the storedwebhook_secretis now decrypted viadecrypt_token()before being passed to_validate_webhook_signature. ATokenEncryptionError(e.g. key rotated withoutREDCAP_ENCRYPTION_KEY_PREVIOUS) is logged and the handler falls through to the "no secret configured" branch — which rejects in production via the existing gate. Failing closed beats a 500 on a public endpoint. - Audit-log redaction unchanged — the existing
had_webhook_secret/has_webhook_secret/webhook_secret_rotatedkeys remain bool-only; secret values (plaintext or ciphertext) never enterold_values/new_values. New regression test asserts neither plaintext norenc:v2:strings appear in either blob. - Migration script (
server/migrate_redcap_tokens.py) — refactored around an explicit_ROTATABLE_FIELDS = ("api_token", "webhook_secret")allowlist. The sameneeds_rotation/rotate_tokenpipeline applies to both fields, dry-run by default;--applywrites. The allowlist is asserted by a unit test so a future contributor cannot silently dropwebhook_secretand regress the invariant. - Test coverage: 7 tests in
tests/test_webhook_secret_encryption.pycovering (a) PATCH writesenc:v2:, plaintext absent from stored value, decrypt round-trips; (b) explicit-empty PATCH does not encrypt the empty string; (c) DET request with HMAC computed over the plaintext secret is accepted (proves decrypt-before-compare); (d) wrong-secret signature is rejected; (e) audit log contains neither plaintext norenc:v2:ciphertext; (f)rotate_tokenupgrades a plaintext webhook_secret in place; (g)_ROTATABLE_FIELDSincludes both fields. All green; no regressions acrosstest_redcap_rbac.py,test_webhook_idempotency.py,test_token_encryption.py(47 tests verified).
- Write site (
-
Patient portal needs
surveytask type and stale-data indicator. ✅ Fixed 2026-05-07 (folded into #4) Resolved with the same patch set as §6 #4. Specifically:task_type="survey"is now a first-class portal task (model + migratione3f4a5b6c7d8),start_taskreturns the storedexternal_url(or 503 if missing — fail-safe),VisitScheduleItemexposesredcap_event_name/redcap_sync_status/last_redcap_sync_at/is_stale, andGET /api/portal/randomizationsurfaces the assignment as read-only with masked-label preference for blinded studies. See §6 #4 above for the full resolution. -
redcap_event_sync.sync_events_to_visit_windowscommits on partial failure. ✅ Fixed 2026-05-08server/app/services/redcap_event_sync.py:222previously randb.commit()unconditionally at the end of the per-event loop, even whenresult.errors[]was populated. A failure mid-batch — for example, REDCap returning a malformed event mapping or a manual battery override pointing at a deleted battery — would leave a half-synced set ofVisitWindowrows persisted, with no rollback option for the operator. Resolution:- Outer SAVEPOINT — the per-event loop now runs inside
async with self.db.begin_nested():. If any event raises in the inner try/except,result.errorsis populated and the method raises an internal_AtomicSyncFailureso the SAVEPOINT discards every row added/updated by the events that did succeed. Mirrors the §6 #6 versioning service pattern. - No service-level commit —
await self.db.commit()removed from the body. The FastAPIget_dbdependency owns the outer transaction commit, so a partial failure cannot leak past the request boundary; tests and routes that previously relied on the in-service commit instead see the rows via the session's dirty set until the request completes. - Counter reset on rollback —
result.createdandresult.updatedare zeroed when the SAVEPOINT rolls back, since the actual rows were discarded.result.errorsandresult.detailsare preserved so the caller can see what was attempted and what failed. - All-or-nothing semantics chosen over per-event SAVEPOINTs — operator mental model is "sync this batch of REDCap events"; partial success would leave window A consistent against a stale view of event B and force the operator to reason about which subset committed. Re-running the sync after fixing the underlying issue is the cleaner workflow. Per-event SAVEPOINTs remain available as a future enhancement if a study with hundreds of events ever needs partial progress.
- Test coverage: 7 tests in
tests/test_redcap_event_sync_atomicity.py: - TestSyncEventsAtomicity — success path persists all 3 windows; mid-batch failure (forced via monkeypatched
_match_instruments_to_batteryraising on event 2) returnssuccess=Falsewith the failed event's name inresult.errors,result.created == 0, and zeroVisitWindowrows in the DB; mid-batch failure during update_existing leaves a pre-seeded window's sentineltarget_day=999andnameunchanged (asserts SAVEPOINT undoes mutations to existing rows, not just inserts); contract test countsdb.commit()invocations and asserts zero. - TestSyncEventsExitsCleanlyOnPreFlightFailure — REDCap fetch failure (early return path) does not open a SAVEPOINT and leaves the session usable for subsequent queries.
- TestAtomicSyncFailureSentinelExists — structural guard:
_AtomicSyncFailureis importable as an Exception subclass;inspect.getsource(sync_events_to_visit_windows)contains bothbegin_nestedand_AtomicSyncFailure, so an accidental revert of the SAVEPOINT pattern fails the test rather than silently regressing the §6 #21 contract. - All 7 tests green; no regressions across
test_redcap_managed_lifecycle.py(10 tests),test_redcap_det_sync_versioning.py(7 tests),test_schedule_versioning_atomicity.py(8 tests),test_webhook_secret_encryption.py(7 tests) — 39 tests total verified.
- Outer SAVEPOINT — the per-event loop now runs inside
🟡 Minor (REDCap path — discovered 2026-05-06 review #2)¶
-
Double retry layering in REDCap sync. ✅ Fixed 2026-05-10
redcap_sync._sync_sessionran a 3× outer retry loop wrappingREDCapService.import_cognitive_data, which itself retried via_retry_with_backoff(3×) — worst case 9 PyCap calls per user-visible sync attempt. The outer loop also masked validation errors behind transient retries. Resolution:- Outer loop removed —
_sync_sessionnow callsimport_cognitive_dataexactly once per sync. The inner_retry_with_backoff(server/app/services/redcap.py:54,MAX_RETRIES=3) is the sole retry layer; transient errors are absorbed there before the result reaches_sync_session. - Validation vs reachability split — when REDCap returns a validation error (
"validation"/"invalid"substring on the error message), the session is still markedfailedper the §6 #14 fail-safe, but the circuit breaker (#23) treats the response as a reachability success — REDCap is up and rejecting bad data, so it would be wrong to disable the project while operators fix the payload. - Structural guard —
tests/test_redcap_retry_and_circuit_breaker.py::TestSingleRetryLayer::test_outer_retry_loop_removed_in_sourcegreps_sync_sessionsource forfor attempt in range(. A future contributor restacking retries fails this test before a regression ships. - Test coverage: 2 tests in the new suite (call-count assertion + structural guard).
- Outer loop removed —
-
No circuit breaker on REDCap API. ✅ Fixed 2026-05-10 A study with a misconfigured URL or persistently unreachable project would retry on every submission indefinitely, amplifying load and burning the retry budget without operator visibility. Resolution:
- New service
app/services/redcap_circuit_breaker.py— per-study breaker keyed bystudy_id, in-process state, async-safe via per-study locks. State machine:closed→openafterfailure_thresholdconsecutive reachability failures (default 5) →half_openaftercooldown_seconds(default 60) →closedon a successful trial or back toopenon a failed trial. Singletoncircuit_breakerinstance is whatredcap_syncimports; thresholds are env-tunable viaREDCAP_CIRCUIT_BREAKER_FAILURE_THRESHOLD/REDCAP_CIRCUIT_BREAKER_COOLDOWN_SECONDS/REDCAP_CIRCUIT_BREAKER_ENABLED. - Wired into
_sync_session—await circuit_breaker.allow(study_id)runs before the REDCap call. When closed, the call proceeds and the breaker is updated withrecord_success/record_failurebased on the result. When open, the session is markedsync_status="failed"witherror_type="circuit_open"exactly as a real REDCap failure would be — preserving the §6 #14 no-fallback invariant. The breaker MUST NOT compute or substitute Metricis-side data; the module never reads session contents. - Validation errors are not reachability failures — see #22 split. The breaker counter is unaffected by payload-rejection responses.
- Per-study isolation — opening the breaker for Study A leaves Study B unaffected (independent
_BreakerStateentries). - Test coverage: 11 tests in
tests/test_redcap_retry_and_circuit_breaker.pycovering the full state machine (open after threshold, short-circuit without PyCap call, half-open success closes, half-open failure re-opens, per-study isolation, validation errors don't trip), plus standalone breaker unit tests (disabled-mode, threshold validation, reset). Manual-clock injection avoids realsleepin tests.
- New service
-
_find_study_by_redcap_projectis O(N studies) scan. ✅ Fixed 2026-05-10routers/webhooks.py::_find_study_by_redcap_projectpreviously read every REDCap-mode study, decoded each row's JSONBconfigin Python, and scanned for a matchingconfig['redcap']['project_id']. Run on every DET webhook fire, the cost grew linearly with the number of REDCap-managed studies in the deployment. Resolution:- Partial functional B-tree index added via Alembic migration
f0a1b2c3d4e5:CREATE INDEX ix_studies_redcap_project_id ON app.studies ((config -> 'redcap' ->> 'project_id')) WHERE integration_mode = 'redcap'. Partial because the only consumer also filters onintegration_mode='redcap', so re-indexing metricis-mode rows would be wasted space. - Query rewritten as a single SQL equality lookup:
WHERE integration_mode = 'redcap' AND config['redcap']['project_id'].astext = :project_id. Returns the single matching row directly — no Python-side iteration, no JSONB-decode-per-row. - Symmetry with §6 #15 preserved: the partial index's
WHERE integration_mode = 'redcap'predicate matches theis_redcap_managed(study)gate exactly. A metricis-managed study carrying a strayconfig.redcap.project_idis excluded from both the index and the query. - Test coverage: 7 tests in
tests/test_redcap_project_id_lookup.pycovering correctness (matching project_id resolved, unknown returns None, metricis-mode same-id rejected, redcap+metricis coexistence picks redcap, missing project_id key returns None, int → str coercion) and a structural DDL invariant thatpg_indexesstill contains the partial functional index with the JSONB expression andWHERE integration_mode = 'redcap'predicate. The DDL check guards against a future migration silently dropping the index and regressing to a sequential scan only visible under load.
- Partial functional B-tree index added via Alembic migration
-
Public test endpoint
/webhooks/redcap/det/testreflects payloads back unauthenticated. ✅ Fixed 2026-05-07 (folded into #16): the endpoint now returns 404 in production (routers/webhooks.py:430checkssettings.environment == "production"before any reflection) and is rate-limited at 10/min viawebhook_limiter. Asserted bytests/test_production_gating.pyandtests/test_redcap_rbac.py::TestDETTestEndpointProductionGate.
Deferred to M9 — Metricis-managed mode hardening¶
These are real bugs but the first production study runs in REDCap-managed mode where REDCap handles randomization and scheduling. They block Metricis-managed mode from being production-ready, not the upcoming REDCap-managed launch.
🟠 Significant (Metricis-managed only)¶
-
Randomization stratum form lookup misses form filter.
randomization_service.py:198-210queries byparticipant_id+workflow_statusonly —form_oidis read but not filtered in the SQLwhere. Wrong form → wrong stratum. -
Randomization uses non-cryptographic RNG.
randomization_service.py:299usesrandom.choice. For interventional trials, allocation should usesecrets.SystemRandomto prevent prediction from a known seed. -
Minimization silently degrades to simple randomization.
randomization_service.py:425falls back with only a warning. For studies that selected minimization, this is a quiet protocol deviation. -
Ad-hoc RBAC in randomization router.
routers/randomization.py:363,435,529,582,678— role checks are inline instead of using arequire_role()dependency. Risk of inconsistent enforcement around blinded data. -
Business Day Service: no timezone awareness, no holiday import path.
business_day_service.pyoperates on naivedateobjects._holiday_cachefield is declared but never populated → N+1 queries inbulk_calculate. Studies start with emptyHolidaytables unless manually seeded — easy to silently miss statutory holidays.
Test coverage gaps¶
| Area | Status | Priority |
|---|---|---|
token_encryption.py |
✅ 15 tests (2026-05-07) | M7 critical |
routers/testing.py (time simulation) |
✅ Asserted via structural invariant + service-layer tests (2026-05-07) | M7 critical |
| Production gating (unified) | ✅ tests/test_production_gating.py — 25 tests (2026-05-07) |
M7 critical |
sync_status no-fallback invariant |
✅ Asserted (2026-05-07) | M7 critical |
integration_mode ⇄ redcap_enabled consistency |
✅ Asserted (2026-05-07) | M7 critical |
| DET webhook idempotency / replay | ✅ 7 tests (2026-05-07) | M7 critical |
| REDCap router RBAC | ✅ 25 tests (2026-05-07) | M7 critical |
| REDCap-managed study integration (E2E) | ✅ 10 tests across 5 stages — tests/test_redcap_managed_lifecycle.py (2026-05-07) |
M7 significant |
schedule_versioning_service.py |
✅ 8 atomicity tests — tests/test_schedule_versioning_atomicity.py (2026-05-08) |
M7 |
| REDCap retry + circuit breaker | ✅ 11 tests — tests/test_redcap_retry_and_circuit_breaker.py (2026-05-10) |
M7 minor |
| Anchor-shift reconciliation regression | ✅ 12 tests — tests/test_anchor_shift_reconciliation.py (2026-05-10) |
M7 minor |
| REDCap project_id lookup + index DDL | ✅ 7 tests — tests/test_redcap_project_id_lookup.py (2026-05-10) |
M7 minor |
business_day_service.py |
Zero tests | M9 |
7. Milestones¶
Phased delivery record. Each milestone groups related deliverables with a date span and links to the resulting subsystems. Cross-reference §3 for current state of each subsystem.
M0 — Platform foundation (through 2026-01-04)¶
Status: ✅ Shipped · Span: ~ → 2026-01-04
Deliverables: initial repo scaffolding, jsPsych client, FastAPI server, PostgreSQL schema, base Researcher Portal, REDCap site config, magic-link patient auth.
Closing milestone: scheduler service for visit schedules and reminders (1bccbe7, 2026-01-04).
Subsystems landed: #1, #2, #3, #5, #8, #18.
M1 — Major foundational push (2026-01-16)¶
Status: ✅ Shipped · Span: 2026-01-16 (single commit consolidation, dbc144c)
Deliverables: WebSocket real-time, PDF reports (ReportLab), participant CSV/Excel import, study templates, access control, mobile services, comprehensive E2E (Playwright) and server (pytest) test suites, GitHub Actions CI/CD, baseline documentation.
Subsystems landed: #5 (audit groundwork), #25 (test infra), #28 (CI/CD), Capacitor base for #27.
M2 — EDC core feature build (2026-01-17 → 2026-01-22)¶
Status: ✅ Shipped · Span: 6 days, ~20 commits
Deliverables:
- Cognitive assessment module registry expansion
- Capacitor native config + unified error handling + session security
- Notification templates, REDCap event sync, participant link service
- REDCap survey link generation and participant management endpoints
- VisitService for EDC visit management; SDTM validation + export; Assessment ODM Exporter
- Portal task service (unified task queue)
- Patient/Caregiver Portal completes (magic link, mobile-first, i18n)
- Form workflow state machine + form templates management
- Registry Follow-up Cohort service
Closing milestone: AI Agent Phase 1 (LLM service + foundation) — 51b0c4f, 2026-01-21.
Subsystems landed: #3, #6, #7, #9, #11, #13, #16, #19, #27.
M3 — AI Agent Phase 2 (2026-01-22)¶
Status: ✅ Shipped · Span: single-day milestone (30653f7, 2026-01-22)
Deliverables: PDF document processing service, protocol ingestion task, system prompt templates, ChangeSet creation from extracted study structure (events + forms), StudyAssistant portal page (711 LOC) with upload + run + ChangeSet review workflow.
Subsystem landed: #12.
M4 — EDC operations hardening (2026-01-23 → 2026-01-28)¶
Status: ✅ Shipped (with known issues) · Span: 6 days
Deliverables:
- Source Data Verification (SDV) and Form Validation services
- Quality flags & validation rule infrastructure
- Comprehensive documentation refresh; CI improvements (path filters, ESLint v9)
- Auto-logout on token expiry; portal_base_url config
- Business Day Service for scheduling (weekend/holiday) — ⚠️ §6 #8
- Schedule Versioning + Unified Scheduling services — versioned schedule snapshots tied to anchor date
- Comprehensive tests for migrations, rate limits, security headers, unified scheduler
- Randomization module — full stack (DB, services, API, portal UI, alembic) — ⚠️ §6 #3, #4, #5, #6
Subsystems landed: #14, #15, #17, #20.
M5 — Infrastructure & polish (2026-02-06 → 2026-02-12)¶
Status: ✅ Shipped (with known issues) · Span: 1 week
Deliverables:
- Multi-stage Docker builds for Nginx + FastAPI server (da0792b)
- TimeSimulationBanner for test-mode simulated date — ⚠️ §6 #1 (not gated to dev mode)
- Digit Symbol Matching Task expanded to 9 symbols
- Development server port refresh across configs
Subsystems landed: #23 (with caveats).
M6 — Security & docs hygiene (2026-04 → 2026-05)¶
Status: ✅ Partially shipped · Span: ongoing
Deliverables:
- Token encryption service for REDCap tokens (Fernet); tests for ConsentSign, Login, Schedule (4e3a55d, 2026-04-21) — ⚠️ §6 #2 (token encryption itself has no tests; falls back to plaintext)
- CLAUDE.md condensed 1530 → 326 lines (17056a8, 2026-05-06)
- This consolidated project plan hub (2026-05-06)
- REDCap router RBAC + audit logging + rate limiting + DET test endpoint production gate (2026-05-07) — fully resolves §6 #16. Two-stage rollout same day:
1. Initial commit (b3d022b): 22 reads gated to admin/researcher/coordinator, 12 mutations to admin/researcher, 7 high-risk endpoints to admin global-role; 7 audit-logged high-risk operations with secret redaction; slowapi rate limiting on the unauthenticated DET surface; 404 gate on /webhooks/redcap/det/test in production; 16 tests.
2. Hardening commit: replaced global-admin gate with explicit per-study admin/owner UserStudy membership check (_require_study_admin_membership). Even system admins must be explicit study members to rotate REDCap tokens, push data dictionaries, delete forms, sync participants, or update webhook secrets. Test coverage expanded to 25 tests including per-study isolation (admin on study A can't mutate study B).
- DET webhook idempotency + persistence + DLQ scaffolding (2026-05-07) — resolves §6 #17. Adds WebhookEvent model and migration d2e3f4a5b6c7; payload hashing with deterministic canonicalization; dedup window via configurable TTL; duplicates persisted as audit rows pointing back to the original; receipt persisted before background processing so a crash leaves observable state for retry. 7 tests covering persistence, idempotency, and hash properties.
Subsystems landed: #22 (with caveats), #19 (RBAC complete; idempotency still pending — §6 #17), documentation hub.
M7 — REDCap-managed study production readiness (engineering complete; sponsor sign-off pending)¶
Status: ✅ §6 closed (every 🔴/🟠/🟡 resolved) · cutover checklist landed (sponsor-study-cutover.md) · M7 ships when the first sponsor study walks the checklist and records sign-off in the audit log · Span: Q2 2026 · Priority: OPERATIONAL GATE — active development moves to M8
Recent progress: §6 #16 (REDCap router RBAC + audit + rate limit + DET test gate), §6 #17 (DET webhook idempotency + WebhookEvent persistence), §6 #14 (sync_status no-fallback invariant + portal banner + dashboard alert), §6 #15 (integration_mode unification via is_redcap_managed(study) helper + dual structural/row-level invariant), §6 #1 (time simulation gating + audit), §6 #2 + #7 (token encryption hardening with enc:v2: + key rotation), §6 #3 (unified production gating test suite), §6 #4 + #20 (patient-portal REDCap data ingestion: task_type="survey", stale-data flag on visit schedule, randomization read-only display), §6 #5 (REDCap-managed study lifecycle E2E test, 10 tests across 5 stages — fixed two pre-existing defects in redcap_sync._build_submission_from_responses) all shipped 2026-05-07. §6 #6 (schedule versioning atomic via begin_nested SAVEPOINT + auto_commit=False plumbed through unified/legacy/EDC schedulers; 8 tests), §6 #18 (DET sync routed through ScheduleVersioningService + UnifiedScheduler + ConsentGate; removed broken _create_visit_schedule + bypassing _create_visits_from_windows; fixed default field mapping that referenced non-existent Participant columns; 7 tests), and §6 #19 (webhook_secret encryption parity — encrypt_token() symmetric with api_token, DET handler decrypts before HMAC validation, migration script extended to both fields via _ROTATABLE_FIELDS; 7 tests), and §6 #21 (REDCap event sync transactional boundary — per-event loop wrapped in db.begin_nested() SAVEPOINT, partial failure raises _AtomicSyncFailure to roll back the entire batch, service-level commit removed in favour of the FastAPI request lifecycle; 7 tests including a structural guard) shipped 2026-05-08. §6 #22 (collapsed REDCap retry layers — single retry layer at the REDCapService boundary, outer 3× loop removed; structural guard prevents re-stacking) and §6 #23 (per-study REDCap circuit breaker with closed/open/half_open state machine; opens after threshold reachability failures, short-circuits without touching PyCap, recovers via half-open trial; validation errors don't count as reachability; per-study isolation) shipped together 2026-05-10 with 11 tests in tests/test_redcap_retry_and_circuit_breaker.py. §6 #25 also marked closed (folded into #16). §6 #8 (anchor-shift reconciliation regression suite — 12 tests in tests/test_anchor_shift_reconciliation.py pinning completed-visit preservation, pending-visit soft-cancellation, version supersession, audit-metadata wiring, and original_target_date idempotency across multiple anchor shifts) shipped 2026-05-10. §6 #24 (partial functional index ix_studies_redcap_project_id via Alembic f0a1b2c3d4e5 + SQL-level project_id lookup in _find_study_by_redcap_project; 7 tests including a pg_indexes DDL invariant) shipped 2026-05-10 — all M7 §6 items now closed.
Goal: ship a production-grade Metricis deployment for the first sponsor study, where REDCap is the source of truth for randomization, data collection forms, events, and visit windows and Metricis delivers the patient/caregiver experience on top.
Critical deliverables (block production):
- ✅ Gating audit + test suite (§6 #3, fixed 2026-05-07) — tests/test_production_gating.py (25 tests) proves every /api/dev/*, /api/testing/*, and /api/webhooks/redcap/det/test endpoint returns 404/403 in production; service-level + token-encryption defence-in-depth assertions included
- ✅ Time simulation gated (§6 #1, fixed 2026-05-07) — service gate (_time_simulation_allowed) + router-level production 404 + _require_dev_mode_admin on every mutation + AuditLog row + 3 invariant tests including dependency-tree walk over routers/testing.py
- ✅ Token encryption hardened (§6 #2 + #7, fixed 2026-05-07) — TokenEncryptionConfigError on missing key in production; plaintext reads rejected by default (one-time bridge via REDCAP_ENCRYPTION_ALLOW_PLAINTEXT_READS with CRITICAL log); enc:v2: write format with enc:v1: back-compat; key rotation via REDCAP_ENCRYPTION_KEY_PREVIOUS + MultiFernet + rotate_token(); one-shot migration script migrate_redcap_tokens.py; 15 unit tests
- ✅ sync_status consumption invariant (§6 #14, fixed 2026-05-07) — invariant test asserts no service reads Session.sync_status for fallback decisions; patient portal banner + coordinator dashboard alert
- ✅ integration_mode unification (§6 #15, fixed 2026-05-07) — is_redcap_managed(study) is the canonical gate; every gating call site migrated; structural reader-allowlist invariant + row-level consistency invariant + helper unit tests
- ✅ REDCap router RBAC + audit + rate limiting (§6 #16, fixed 2026-05-07) — require_role per role-class; per-study admin/owner check on 7 high-risk endpoints; AuditLog rows with secret redaction; DET webhook rate-limited; /webhooks/redcap/det/test 404 in production; 25 tests
- ✅ DET webhook idempotency + replay protection (§6 #17, fixed 2026-05-07) — WebhookEvent model + composite dedup index + payload-hash canonicalisation + duplicate-row persistence + 7 tests; Celery transport remains a future enhancement
Significant deliverables (REDCap path):
- ✅ Patient portal REDCap data ingestion (§6 #4, #20, fixed 2026-05-07) — task_type="survey" with external_url launcher (idempotent on (participant, instrument, event, repeat_instance)), start_task returns 503 if URL missing (fail-safe); VisitScheduleItem carries redcap_event_name/redcap_sync_status/last_redcap_sync_at/is_stale (stale = failed sync OR never reconciled OR >24h); new GET /api/portal/randomization with masked-label preference for blinded studies; migration e3f4a5b6c7d8; 14 tests
- ✅ REDCap-managed study E2E test (§6 #5, fixed 2026-05-07) — tests/test_redcap_managed_lifecycle.py: 10 tests across 5 stages (event sync, visit scheduling, portal delivery, completion sync success+failure, portal stale-data signalling). REDCap I/O mocked at module-level fetchers and REDCapService._get_project. Surfaced and fixed two pre-existing critical defects in redcap_sync._build_submission_from_responses (AssessmentMetadata import drift + missing min_rt/max_rt defaults) — both would have raised at production runtime and silently classified syncs as failed.
- ✅ Route DET sync through UnifiedScheduler + ConsentGate (§6 #18, fixed 2026-05-08) — _create_visit_schedule and _create_visits_from_windows removed (the former referenced non-existent columns and would have TypeError'd at runtime); replaced with _seed_anchor_date_and_schedule that creates a ParticipantAnchorDate (source_type=redcap_det) and delegates to ScheduleVersioningService.create_schedule_version. Default field mapping pruned of non-existent Participant columns. Legacy event_battery_mapping logged as deprecation. 7 tests in tests/test_redcap_det_sync_versioning.py.
- ✅ Webhook secret encryption parity (§6 #19, fixed 2026-05-08) — encrypt_token() symmetric with api_token at the write site (routers/redcap.py::update_webhook_config); DET handler decrypts via decrypt_token() before HMAC validation, with graceful fallback to "no secret" → production reject on rotation/key errors; migrate_redcap_tokens.py extended via _ROTATABLE_FIELDS = ("api_token", "webhook_secret"); 7 tests in tests/test_webhook_secret_encryption.py.
- ✅ REDCap event sync transactional boundary (§6 #21, fixed 2026-05-08) — sync_events_to_visit_windows wraps its per-event loop in a db.begin_nested() SAVEPOINT and raises _AtomicSyncFailure on partial failure so the SAVEPOINT discards every row added/updated by the events that did succeed; service no longer calls db.commit() directly (FastAPI get_db owns the outer commit, mirroring §6 #6). 7 tests in tests/test_redcap_event_sync_atomicity.py including a structural guard that fails if begin_nested or _AtomicSyncFailure is removed from the implementation.
- ✅ Schedule versioning transaction boundary (§6 #6, fixed 2026-05-08) — create_schedule_version and regenerate_schedule wrapped in db.begin_nested() SAVEPOINTs; auto_commit=False plumbed through UnifiedScheduler/SchedulerService/VisitService so inner commits no longer break the boundary; orphan versions on partial failure now impossible. 8 atomicity tests in tests/test_schedule_versioning_atomicity.py.
Subsystems affected: #5, #9, #14, #15, #18, #19, #22, #23, #24, #25.
M8 — AI Agent Phase 3 (active)¶
Status: 🔄 In progress · Priority: PRIMARY (active development) · Spec: ai-agent-implementation-plan.md
Recent progress: battery_config AI task shipped 2026-05-10 — task class at server/app/services/agent_tasks/battery_config.py, prompt template + JSON output schema at server/app/services/agent_prompts/battery_config.py, dispatch wired in AgentService.execute_task. 9 unit tests in tests/test_agent_battery_config.py cover happy path (5-item ChangeSet with 2 batteries + 3 event assignments), existing-VisitWindow anchoring, missing-document error, non-dict LLM output rejection, unexpected exception wrapping, and AgentService dispatch contract.
Goal: AI assistance for jsPsych battery and assessment configuration.
Deliverables:
- ✅ Battery configuration agent task (battery_config)
- 🔜 Event-linking agent task (event_linking)
- 🔜 Portal inline assists ("Suggest module selection" in BatteryBuilder)
- 🔜 Validation pipeline for battery proposals
- 🔜 Apply path implementation (shared with Phase 2)
M9 — Metricis-managed mode hardening (deferred)¶
Status: 🔜 Deferred · Priority: LOW (until Metricis-managed studies become a target)
Rationale: the first production study runs in REDCap-managed mode where REDCap handles randomization and scheduling. The bugs in §6 #9–13 do not block the M7 launch. They will block any future Metricis-managed mode study, so they remain on the roadmap.
Deliverables:
- Randomization fixes: stratum form lookup (§6 #9), secrets.SystemRandom (#10), minimization integrity (#11), require_role() dependency (#12)
- Business Day Service: timezone awareness, holiday import path, populate _holiday_cache (§6 #13)
- Tests for randomization, business day service
M10 — AI Agent Phase 4 (later)¶
Status: 🔜 Planned · Priority: LATER Goal: Amendment Impact Analysis & Cutover Planning agent capabilities.
8. Document Conventions¶
- This hub stays high-level: status, sequencing, progress, known issues. Detailed designs go in spoke documents.
- Update §3 Status table when a subsystem changes state.
- Add or update a milestone in §7 Milestones whenever a coherent body of work ships. Group related commits under a single dated milestone rather than one entry per commit.
- Add to §6 Known Issues whenever a code review surfaces something — link to the file:line. Move resolved items to a "Resolved" subsection or delete if minor.
- Archive superseded plans to
docs/archive/completed-plans/with a date-stamped filename.
9. Spec Spokes¶
Specifications referenced from this hub. Edit the spec, not the hub, for design changes.
Active specs¶
| Spec | Purpose |
|---|---|
| ai-assistant.md | Full AI Agent Facilitation specification (1333 lines) |
| ai-agent-implementation-plan.md | Phased AI Agent implementation plan (Phases 1–4) |
| patient-caregiver-portal.md | Patient/Caregiver Portal specification (✅ implemented; spec header is stale) |
| participant-registry.md | Patient registry framing and architecture |
| rare-disease-integration.md | Rare disease & pediatric adaptations (✅ all phases shipped) |
| edc-implementation-plan.md | EDC core implementation detail |
| redcap-det-webhook-implementation.md | REDCap Data Entry Trigger integration |
| ux-reviewer.md | UX review process for the portal |
Reference¶
- CLAUDE.md — engineering context and architectural concepts
- docs/PAGES.md — page-by-page reference for both portals
- docs/api/ — API reference (Swagger UI live at
/docs) - docs/architecture/ — architecture diagrams and design records
- docs/guides/ — developer and administrator guides
- docs/features/ — user-facing feature documentation
Archived¶
docs/archive/completed-plans/project-plan-2026-05-06.md— previous top-level project plan, superseded by this hub- Other completed/archived plans in
docs/archive/completed-plans/
10. How to Update This Document¶
When you ship something:
- Update the relevant row in §3 Status by Subsystem (state, evidence link).
- Either extend an active milestone in §7 or open a new one if it's a distinct phase of work.
- Update the "Last updated" stamp at the top.
When you find a bug or risk during review:
- Add it to §6 Known Issues with severity, file:line reference, and a one-line "Fix:" stub.
- Once fixed, move it to a "Resolved" subsection with the resolving commit hash, or delete if minor.
When you write a new spec:
- Place it in
docs/project-plan/(feature specs) ordocs/plans/(implementation plans). - Add a row to §9 Spec Spokes pointing to it.
- Cross-link from the spec back to this hub via a
**Parent Document:**line.