Executive Summary
Top 3 Priorities
- 1 Replace X-User-Id header auth with real tokens before any real tenant data is ingested
- 2 Lock down CORS fallback from wildcard to explicit allowlist
- 3 Add structured request logging and PHI-safe observability (no PHI in logs exists today)
What's Working Well
- Good Parameterized queries throughout — no SQL injection surface
- Good Zod validation on all write endpoints
- Good PHI scanning middleware with 10-pattern regex suite
- Good Immutable versioning with full snapshot history
- Good 42 CFR Part 2 SUD constraint enforced at schema + Zod levels
- Good Credit ledger is event-sourced with tamper-evident receipt hashes
- Good Comprehensive test coverage for happy paths and auth flows
Finding Summary
- Critical 2 findings
- Major 6 findings
- Minor 8 findings
- Note 5 findings
🚨 Security Flags — Critical First
src/middleware/tenant.ts:16 The entire API is authenticated by a single header value: X-User-Id. Any caller who obtains or guesses a valid UUID can impersonate that user. This is acknowledged in the codebase as "Phase 1 auth" pending Auth0/Clerk, but it is critical to flag explicitly because the platform is designed to hold PHI-adjacent clinical data and is nearing multi-tenant production readiness.
The risk is compounded by several adjacent issues: (a) there is no token expiry, (b) there is no device binding, (c) there is no MFA surface, and (d) there is no IP-based gating. An attacker who intercepts or reads a single logged user ID has full authenticated access. This needs to be the first thing replaced before real tenants are onboarded.
What to do: Complete Phase 2 (Auth0 or Clerk). In the interim, add an API key layer as a bridge — even a static per-tenant bearer token checked against the database is vastly better than a bare UUID header. Consider gating all non-demo traffic behind VPN or IP allowlist until real auth ships.
src/server.ts:15-18 The CORS config reads process.env.CORS_ORIGIN || '*'. If CORS_ORIGIN is not set — in staging, CI, or any deployment that forgets the env var — the API is open to cross-origin requests from any domain. The CloudFormation template correctly sets a production origin (https://platform.stratumcollective.co) but nothing enforces that this env var is set in non-production deployments.
On a healthcare platform, a wildcard CORS policy means a malicious page hosted anywhere can make credentialed requests (using the credentials: true setting that is currently combined with the wildcard) on behalf of authenticated users. Note: browsers actually block credentials: true with wildcard — but the intent is wrong and the wildcard is still overly permissive for public APIs hitting a sensitive data store.
What to do: Default CORS origin to https://platform.stratumcollective.co in the code. Only allow localhost variants when NODE_ENV === 'development'. Never allow * for a credentialed API. Fail the server startup if CORS_ORIGIN is missing in production.
src/routes/precedents.ts:43 src/routes/marketplace.ts:124 Error handlers log raw error objects via console.error('Create precedent error:', err). If a database error, Zod parse error, or upstream exception happens to contain user-supplied text (which in this system can include clinical language), that data ends up in whatever logging sink is consuming stdout.
More directly: if someone submits a request containing PHI that slips past the regex guards (e.g., a name-only or context-dependent identifier), it will be logged to console. HIPAA's 2025 proposed updates eliminate the "addressable vs. required" distinction, making log sanitation a hard requirement rather than an addressable one.
What to do: Replace bare console.error with a structured logger (Winston or Pino) configured with a PHI sanitizer transform. Log error codes and stack traces, never request bodies. Apply this to the entire codebase.
src/db/connection.ts:16 The production SSL config uses ssl: { rejectUnauthorized: false }. This disables certificate chain validation, making the database connection vulnerable to man-in-the-middle attacks. Any adversary who can intercept traffic between the Node.js process and the RDS instance could present a forged certificate and read all database traffic, including clinical precedent data.
AWS App Runner connects to RDS via a VPC connector, so in the specific CloudFormation deployment the attack surface is reduced — but the policy is wrong and will cause problems if the database is ever accessed from outside that VPC boundary (e.g., admin scripts, data pipelines, disaster recovery procedures).
What to do: Set rejectUnauthorized: true and provide the RDS CA bundle (downloadable from AWS) as the ca property. AWS publishes regional RDS CA certificates specifically for this purpose.
src/middleware/audit.ts src/routes/precedents.ts:229-241 The export audit log call is wrapped in try/catch { /* non-critical */ }. The audit middleware itself has no retry logic, no dead-letter queue, and no alerting. If the audit table write fails for any reason (DB connection blip, constraint violation, disk pressure), the event is silently swallowed.
HIPAA audit log requirements (45 CFR §164.312(b)) are not optional — you must be able to prove who accessed what and when. A silently failing audit log is worse than no audit log because it creates a false sense of compliance. The 7-year retention schema exists, but retention of records that were never written is zero.
What to do: Audit writes should fail loudly via structured logging (never silently catch). For the export endpoint specifically, consider moving audit logging before the response is sent (with a circuit breaker pattern) rather than fire-and-forget after.
src/server.ts:22-27 The rate limiter is applied globally at 500 requests per 15-minute window. For a healthcare API that serves a small number of known tenants, 500 req/15 min is extremely generous — it provides negligible protection against abusive tenants (who can exhaust quota before hitting the limit) or against an attacker who has obtained a valid user ID.
There is also no differentiation between the paid marketplace routes (which burn credits) and the free registry routes. A caller could hammer /api/registries/payers 499 times without any rate limiting consequence.
What to do: Implement tiered rate limiting: stricter limits on write endpoints (50/15min per tenant), moderate on reads (200/15min), and lenient on public registry endpoints (1000/15min with no auth). Use keyGenerator in express-rate-limit to key by tenant_id (from the auth context) rather than IP address.
src/middleware/phi-validation.ts:6-16 The PHI guard covers 10 patterns. HIPAA Safe Harbor (45 CFR §164.514(b)(2)) defines 18 identifier types. Missing coverage includes: geographic data smaller than state (zip codes), dates other than year (birth dates, admission dates), ages over 89, vehicle and device identifiers, biometric identifiers (finger/voice prints), and full-face photographs. A field containing a zip code or date of birth would pass through undetected.
The existing patterns also have some fragility: the Phone pattern would match many non-PHI numeric sequences, and the "Date" pattern for MM/DD/YYYY would block date ranges that appear in clinical policy references (e.g., "see §4.2 dated 01/15/2024").
src/server.ts:14 helmet() is applied with no configuration. This installs Helmet's defaults including a Content Security Policy, but Helmet 8's default CSP is lenient enough to allow inline scripts and data URIs. For a pure JSON API (no HTML responses), Helmet should be configured explicitly: disable features irrelevant to an API (crossOriginEmbedderPolicy, etc.) and set a minimal CSP (default-src 'none'). Additionally, hsts should be explicitly configured with a long max-age and includeSubDomains.
src/registries/index.ts:61-64 Registry files are loaded with readFileSync inside the loadRegistry function, which is called lazily. If the sibling stratum-corpus-data repo is missing in a deployment, the first request to /api/registries/* will throw an unhandled exception that propagates through the route handler. The public registry routes have no specific error handling for this case — they'd return a 500 with the generic error handler's generic message. A startup check would give a clearer operational signal.
Debugging Findings — All Categories
API / UX Design
src/routes/precedents.ts:244 The bulk export endpoint is registered as POST /precedents/export which comes after DELETE /precedents/:id in the route file. In Express, since the POST verb differs from DELETE/GET, this is safe — but the inline comment "// IMPORTANT: This route must come AFTER..." in marketplace.ts shows this pattern has already caused a real confusion. The same anti-pattern appears in the precedents router: mixing static path segments (/export) with dynamic ones (/:id) in the same router without an explicit route ordering convention is fragile as the codebase grows.
Additionally, POST /export is a non-REST pattern for what is effectively a filtered GET. Consider GET /export with query params, or a dedicated export endpoint namespace.
src/routes/precedents.ts:152-159 GET /precedents/:id/versions returns a raw array. All other list endpoints in this codebase return { data: [...], total, limit, offset }. This inconsistency means API consumers need different logic to handle this endpoint. For a highly-versioned precedent object that goes through many changes, this endpoint could also return an unbounded number of rows with no pagination.
src/routes/precedents.ts:163-171 POST /precedents/:id/apply returns only { success: true } after incrementing reuse_count. The caller must make a second GET request to see the updated state. In the test suite, this is exactly what happens (lines 373-388). REST convention is to return the mutated resource. This is a minor but consistent DX friction point.
The API has no versioning prefix (/v1/). For an early-stage product this is understandable, but the absence of versioning means any breaking schema or route change will require coordinated deploys with all API consumers. Given that the CLAUDE.md notes "Phase 2 will use Auth0/Clerk" — a breaking auth change — the migration will be easier if there's a /api/v1/ prefix already in place. The health check is at /health (un-versioned), which is correct convention.
src/routes/registries.ts The public registry routes use manual query param destructuring rather than Zod schemas. Typos (?famly=Anthem instead of ?family=Anthem) return unfiltered full results silently. This is not a security issue (data is public and read-only) but is a developer experience problem — callers get no indication that their filter was ignored.
Performance
src/pipeline/runner.ts:119-125 The full pipeline runner uses a sequential for...of loop over all clusters. Each iteration calls processCluster() which issues multiple DB queries and transactions. At scale (hundreds of clusters), this will run for many minutes inside a synchronous HTTP request triggered by POST /marketplace/pipeline/trigger. The HTTP response will either time out (App Runner has a 120s default) or hang the thread.
The admin/allocate-monthly endpoint has the same problem: it runs one await query() per tenant in a serial loop with no batching, no transaction wrapping the whole operation, and no idempotency guard against running twice.
src/routes/precedents.ts:253 The bulk export endpoint overrides the limit to 1000 and loads all rows into memory before streaming the CSV. For a platform accumulating precedents across many tenants, 1000 rows of JSONB-heavy records (each with full evidence kits, narrative templates, and traceability ledgers) will create significant memory pressure. The 1mb JSON body limit compounds this — if a tenant's 1000-record export is large, this will OOM.
src/routes/marketplace.ts:80 src/middleware/marketplace-access.ts:34 The getBalance() function issues a SUM(amount) aggregate query. In the /profiles/:payer endpoint, balance is checked once pre-deduction and the ledger INSERT happens in a separate transaction. This is a read-before-write without a locking mechanism — a highly concurrent tenant could theoretically burn more credits than they have if two requests race to the balance check before either commits. The cluster detail flow in marketplace-access.ts does use a transaction for the deduct-then-read pattern, which is correct. The profile endpoint does not.
src/models/precedent.ts:266-279 The search function runs a COUNT query followed by a paginated SELECT. These could be combined using a window function (COUNT(*) OVER ()) into a single round-trip, reducing query latency by ~50% for list operations. Minor at current data volumes but worth noting as the corpus grows.
Code Quality / Architecture
src/routes/precedents.ts:238 The logAuditExport function uses require('../middleware/audit') at runtime instead of a static import. This bypasses TypeScript module resolution, removes type checking on the imported function, and could fail silently if the path changes. It appears to have been done to avoid a circular dependency, but a static import would work fine here — the audit middleware doesn't import from the routes.
src/middleware/marketplace-access.ts:11 The Express Request augmentation declares marketplaceData?: any. This means the cluster detail route at marketplace.ts:60 spreads untyped data directly into the response. Should be typed as MarketplaceCluster | null.
src/db/startup.ts:34 startup.ts calls await pool.end() in the finally block. If startup.ts is ever integrated into the main server boot sequence (which the CLAUDE.md description implies it might be), ending the shared pool would break all subsequent database queries. Currently it appears to be run as a separate script, not imported by server.ts, so this is not an active bug — but it's a trap waiting to be triggered.
src/db/seed.ts:14 The admin user is seeded with patrick@stratumcollective.co as the email, and the seed script prints the user ID to console. While seed data is not production data, this pattern will be replicated if anyone uses this as a template for a production seed, and the console output leaks a real email address into CI logs.
Content / Developer Experience
There is no OpenAPI/Swagger spec, no Postman collection, and no route-level JSDoc. The CLAUDE.md describes the routes at a high level, but a new developer (or a Development Partnership client building against this API) has no machine-readable contract. This becomes significant when the frontend (web/) evolves separately from the API — type drift is hard to catch without a shared contract. Given the platform already has comprehensive Zod schemas, generating an OpenAPI spec from them (e.g., via zod-to-openapi or @asteasolutions/zod-to-openapi) is low-effort and high-value.
src/server.ts:30-32 The health check returns { status: 'ok', timestamp: '...' } but performs no actual health assertion. If the database connection pool is exhausted or the DB is unreachable, the health check still returns 200. Load balancers and App Runner health checks use this endpoint to decide whether to route traffic — a false positive here means a dead instance continues to receive requests. The check should perform a SELECT 1 against the database and report the result.
Product / Operator Gap Analysis
Missing API Endpoints for a Denial Intelligence Platform
| Missing Endpoint | Why It Matters | Severity |
|---|---|---|
GET /api/precedents/:id/similar |
Core product value: "what other precedents match this denial scenario?" The pipeline computes Jaccard similarity but never exposes it via the API. Without this, clients must download the full dataset and compute similarity client-side. | Major |
GET /api/precedents/stats |
Aggregate win rate, denial type distribution, top clusters by reuse — the analytics layer that drives operator dashboards. Currently requires raw queries or client-side aggregation over paginated results. | Major |
POST /api/precedents/classify |
Given a raw denial description, return matching cluster(s) and suggested precedents. The pipeline has the building blocks (cluster matching, Jaccard similarity) but no inference endpoint exists yet. | Major |
GET /api/audit/log |
There's a 7-year audit retention schema but no API to query it. Compliance officers need to retrieve audit records for breach response and OCR investigations. Also needed for the HIPAA "access report" right. | Major |
GET /api/tenants/me |
Clients need to know their own tenant profile, available features, and SUD data handling agreements. Currently the only way to know tenant state is via the balance endpoint on the marketplace. | Minor |
| Admin user management endpoints | POST/GET/PATCH /api/users do not exist. Adding new billers to a tenant requires direct DB access. This will become a support bottleneck the moment the first real customer needs to add a team member. |
Major |
GET /api/marketplace/clusters/:id/drift |
The schema has a drift_signals JSONB column and [] is hardcoded in the pipeline ("deferred"). Without drift signals, the platform can't alert operators when a payer changes their denial behavior and a previously-winning precedent becomes stale. |
Minor |
Where the Pipeline Would Break Under Real-World Load
The pipeline's serial cluster processing (P-1 above) is the primary load risk. Additionally: the monthly credit allocation endpoint iterates all active tenants with individual queries inside a single HTTP request. With 20 tenants this is fine. With 200 it will time out. Neither the pipeline trigger nor the monthly allocation has idempotency — double-triggering will mint duplicate credits unless the receipt_hash uniqueness constraint is the backstop (it is not unique in the schema today). A cron job or queue-backed task runner (SQS + Lambda, or even a pg-cron job) would be more appropriate than an admin HTTP endpoint for these operations.
Observability Gaps
| Missing | Impact |
|---|---|
| Structured logging (request ID, tenant ID, duration, status) | Cannot correlate errors to specific tenants or time windows. Console.log/error with no structure means no log query, no dashboards, no alerting. |
| Metrics (request rate, error rate, DB query timing) | No way to know if the platform is degrading before customers report it. App Runner provides basic CPU/memory but no application-level metrics. |
| Database health check in /health | Load balancer routes traffic to dead instances (see D-2 above). |
| Pipeline run tracking in DB (not just console.error) | When pipeline triggers fail for a cluster, errors are logged to console and swallowed. No persistent record of failed pipeline runs means no alerting and no retry mechanism. |
| Audit failure alerting | Silently failing audit writes (S-5) with no alerting means HIPAA gaps accumulate invisibly. |
Cold-Start: What a Fresh Deployment Needs That Isn't Obvious
- The stratum-corpus-data sibling repo must exist at
../stratum-corpus-data/registries/relative to the platform. This is documented in CLAUDE.md but not checked at startup. A missing corpus dir will crash the registry routes on first request with an unhandled error. - Two separate schema files must be run in order —
schema.sqlthenmarketplace-schema.sql. The migrate.ts script only runs the first. The marketplace schema applies a table alteration (ALTER TABLE precedent_objects ADD COLUMN...) that will fail if run before the base schema. - The production DB role separation (pipeline role with restricted permissions) is commented out in marketplace-schema.sql with "run manually in production." This will be forgotten. It needs to be part of the migration sequence or a Terraform resource.
- No environment variable validation at startup. The server will start even if
DATABASE_URL/PGHOSTis missing — it will fail on the first DB query instead of at boot. A startup validation pass (check all required env vars) prevents confusing first-connection errors in production.
HIPAA Audit Trail Gaps
The audit log table and middleware are solid in concept. Current gaps beyond what's covered in the security findings:
- No audit event for failed access attempts. If a tenant tries to access a precedent they don't own (returns 404), that event is not recorded. HIPAA access reports should track both successful and failed access attempts.
- No audit event for bulk export. The single-record export logs to audit. The bulk POST /export endpoint does not.
- No user context captured in server-level errors. If a request fails at the middleware level before reaching the route, no audit record exists even though a user was identified.
- Audit log rows reference users(id) with ON DELETE CASCADE missing. The version_history table has ON DELETE CASCADE, but the audit_log's user_id FK does not. If a user is ever deactivated/deleted (even hypothetically), this FK will block the delete or leave orphaned audit rows.
- No 7-year retention policy enforcement in the database (e.g., pg_partman with time-based partitioning or a
CHECKconstraint on deletion). The HIPAA requirement exists in the comments but not in the schema mechanics.
Research References
-
HIPAA
HIPAA Compliance for API Developers in Healthcare: Best Practices and Checklist — Accountable HQ. Covers Safe Harbor identifier categories, audit log requirements, and 2025 proposed rule changes eliminating the addressable/required distinction.
accountablehq.com/post/hipaa-compliance-for-api-developers... -
HIPAA
HIPAA Compliance for API Integration in Healthcare — Censinet. Emphasizes minimum-necessary standard for all endpoints, ePHI encryption in transit and at rest, and comprehensive audit logging requirements.
censinet.com/perspectives/hipaa-compliance-api-integration-healthcare -
HIPAA
KEY PRIVACY AND SECURITY CONSIDERATIONS FOR HEALTHCARE APPLICATION PROGRAMMING INTERFACES — ONC/HealthIT.gov. Official HHS guidance on healthcare API security, consent, minimum necessary data, and audit logging.
healthit.gov/sites/default/files/privacy-security-api.pdf -
Node.js
How to Architect a Scalable and HIPAA-Compliant HealthTech Application (Node.js + AWS Guide) — DEV Community. Covers Express security patterns, structured logging with PHI redaction, and AWS HIPAA-eligible service configurations.
dev.to/rank_alchemy — HIPAA HealthTech Node.js Guide -
Node.js
Best Practices for Creating a HIPAA-Compliant NodeJS Host — Atlantic.Net. Covers environment hardening, secure logging, connection pooling security, and dependency vulnerability management.
atlantic.net/hipaa-compliant-hosting/... -
Security
API Security in Healthcare: Protecting Health Data from API Attacks — Cequence Security. Covers per-tenant rate limiting, behavioral analysis, and API abuse patterns specific to healthcare workloads.
cequence.ai/blog/api-security/api-security-healthcare -
RCM
Denial Prevention & Compliance: RCM Strategy for 2026 — PENA4. Covers the IMMP framework (Identify, Manage, Monitor, Prevent), predictive denial scoring, and real-time payer rule integration patterns relevant to the Stratum intelligence layer.
pena4.com/blogs/denial-prevention-strategy... -
FHIR
RESTful FHIR API — HL7.org. Official specification for FHIR resource interactions, HTTP status code conventions, pagination, versioning (
_historypattern), and operation naming. Particularly relevant to Stratum's version history and search patterns.
hl7.org/fhir/http.html -
FHIR
A Practical Guide to HL7/FHIR API Integration — VE3. Architecture patterns for versioning, event-driven resource updates, and interoperability patterns worth borrowing for the Stratum intelligence export layer.
ve3.global/a-practical-guide-to-hl7-fhir-api-integration...
Improvement Vectors
Refine What's There
Two configuration changes that can be made in under 30 minutes and eliminate two significant vulnerabilities. The CORS change is a one-line fix with a conditional for dev. The SSL fix requires sourcing the RDS CA bundle.
// Wide-open fallback, wrong default app.use(cors({ origin: process.env.CORS_ORIGIN || '*', credentials: true, }));
const allowedOrigins = process.env.NODE_ENV === 'development' ? ['http://localhost:3088', 'http://localhost:3000'] : [process.env.CORS_ORIGIN ?? (() => { throw new Error( 'CORS_ORIGIN env var required in production' )})()]; app.use(cors({ origin: allowedOrigins, credentials: true, }));
// Certificate validation disabled ...(isProduction && { ssl: { rejectUnauthorized: false }, }),
// Validate the RDS CA certificate // ca: readFileSync('rds-ca-bundle.pem') ...(isProduction && { ssl: { rejectUnauthorized: true, ca: process.env.RDS_CA_BUNDLE, }, }),
A one-function change that prevents load balancer routing to dead instances and gives ops a meaningful signal during incidents.
import { pool } from './db/connection'; app.get('/health', async (_req, res) => { let dbOk = false; let dbLatencyMs = 0; try { const start = Date.now(); await pool.query('SELECT 1'); dbLatencyMs = Date.now() - start; dbOk = true; } catch { dbOk = false; } const status = dbOk ? 200 : 503; res.status(status).json({ status: dbOk ? 'ok' : 'degraded', timestamp: new Date().toISOString(), db: { ok: dbOk, latency_ms: dbLatencyMs }, version: process.env.npm_package_version, }); });
Key the rate limiter by authenticated tenant ID rather than IP, and apply different limits to write vs. read vs. public routes. This aligns with HIPAA API governance guidance on preventing API abuse in healthcare contexts.
import rateLimit from 'express-rate-limit'; // Public registry — generous, no auth key const publicLimiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 1000, standardHeaders: true, legacyHeaders: false, }); // Authenticated reads — keyed by tenant const readLimiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 200, keyGenerator: (req) => req.auth?.tenant_id ?? req.ip, standardHeaders: true, legacyHeaders: false, }); // Authenticated writes — strict const writeLimiter = rateLimit({ windowMs: 15 * 60 * 1000, max: 50, keyGenerator: (req) => req.auth?.tenant_id ?? req.ip, standardHeaders: true, legacyHeaders: false, }); app.use('/api/registries', publicLimiter); app.use('/api/precedents', readLimiter); // apply writeLimiter per-route on mutations app.use('/api/marketplace', readLimiter);
The current pattern of catch { /* non-critical */ } is the single biggest practical HIPAA audit risk. Audit failures should be surfaced to structured logs and eventually to an alerting channel. A non-blocking but loud pattern:
async function logAuditExport( precedentId: string, userId: string, tenantId: string, format: string ) { try { const { logAuditEvent } = await import('../middleware/audit'); await logAuditEvent(precedentId, 'exported', userId, tenantId, { format }); } catch (err) { // CRITICAL: Audit failure is not silent — it must be logged // Replace with your structured logger (pino/winston) console.error('[AUDIT_FAILURE] Export audit write failed', { precedent_id: precedentId, user_id: userId, tenant_id: tenantId, event: 'exported', error: (err as Error).message, }); // TODO: emit to alerting channel (PagerDuty, SNS, etc.) } }
The pipeline computes Jaccard similarity for anti-gaming purposes but never surfaces this as an API. A GET /api/precedents/:id/similar endpoint is arguably the core value delivery mechanism of the platform — it's what enables a biller to find "what worked before for this exact scenario." The aggregator already has the building blocks. The query below is the core pattern: find cluster siblings, sort by outcome and evidence quality.
// Fetch cluster siblings, ranked by outcome quality export async function getSimilarPrecedents( precedentId: string, tenantId: string, limit = 5 ): Promise<PrecedentObject[]> { const source = await queryOne<{ cluster_id: string }>( `SELECT cluster_id FROM precedent_objects WHERE id = $1 AND tenant_id = $2`, [precedentId, tenantId] ); if (!source) return []; // Rank: Won outcomes first, then by reuse_count, then recency return query<PrecedentObject>( `SELECT * FROM precedent_objects WHERE cluster_id = $1 AND id != $2 AND status != 'Archived' AND data_origin_type != 'SUD' ORDER BY CASE outcome WHEN 'Won' THEN 1 WHEN 'Partial' THEN 2 ELSE 3 END, reuse_count DESC, last_validated_date DESC NULLS LAST LIMIT $3`, [source.cluster_id, precedentId, limit] ); }
Architectural Upgrades
The pipeline trigger endpoint (POST /marketplace/pipeline/trigger) and the monthly allocation endpoint (POST /marketplace/admin/allocate-monthly) run expensive, multi-step operations synchronously on the HTTP thread. At scale this will cause timeouts and resource contention. The correct pattern for a healthcare data platform is to decouple execution from the HTTP response using a job queue.
The simplest path given the current AWS stack (App Runner + RDS) is to use pg-boss or graphile-worker — both run inside Postgres and require no additional infrastructure. The HTTP endpoint creates a job record and returns immediately; a worker pool processes jobs asynchronously. This also gives you built-in retry logic, job history, and visibility into failed runs.
// Instead of: const result = await runFullPipeline(); // Enqueue the job and return immediately router.post('/pipeline/trigger', requireRole('admin'), async (req, res) => { const jobId = await queue.send('run-full-pipeline', { triggered_by: req.auth!.user.id, tenant_id: req.auth!.tenant_id, triggered_at: new Date().toISOString(), }); res.status(202).json({ accepted: true, job_id: jobId, status_url: `/api/marketplace/pipeline/jobs/${jobId}`, }); }); // Worker (runs in separate process or on a schedule): // queue.work('run-full-pipeline', async (job) => { await runFullPipeline(job.data); });
Replace all console.log/error with a structured logger (Pino is the fastest, Winston is more ecosystem-familiar). The key requirement for a HIPAA-adjacent system is that request bodies never appear in logs. Use a serializer that strips body fields and keeps only safe request metadata. Every log line should include a correlation ID traceable to a specific request and tenant.
import pino from 'pino'; export const logger = pino({ level: process.env.LOG_LEVEL || 'info', redact: { // Never log these fields — HIPAA requirement paths: [ 'req.body', 'req.headers.authorization', 'req.headers["x-user-id"]', '*.email', '*.denial_trigger', '*.notes', ], censor: '[REDACTED]', }, serializers: { req: (req) => ({ method: req.method, url: req.url, tenant_id: req.auth?.tenant_id, user_id: req.auth?.user.id, request_id: req.id, // from express-request-id or similar }), err: pino.stdSerializers.err, }, });
The validation schemas in src/utils/validation.ts and src/utils/marketplace-validation.ts are comprehensive and well-typed. Rather than writing API documentation by hand, use @asteasolutions/zod-to-openapi to generate an OpenAPI 3.1 spec from the existing schemas. Serve it at GET /api/openapi.json and add a Swagger UI middleware for a browsable API explorer. This also creates a machine-readable contract for the Next.js frontend to generate typed API clients from.
import { OpenAPIRegistry, OpenApiGeneratorV31 } from '@asteasolutions/zod-to-openapi'; import { CreatePrecedentSchema, SearchQuerySchema } from './utils/validation'; const registry = new OpenAPIRegistry(); registry.registerPath({ method: 'post', path: '/api/precedents', summary: 'Create a new precedent object', tags: ['Precedents'], request: { body: { content: { 'application/json': { schema: CreatePrecedentSchema } } } }, responses: { 201: { description: 'Precedent created' } }, }); export function generateOpenApiSpec() { const generator = new OpenApiGeneratorV31(registry.definitions); return generator.generateDocument({ openapi: '3.1.0', info: { title: 'Stratum Platform API', version: '0.1.0' }, servers: [{ url: '/api' }], }); }
Recommended Next Steps (Prioritized)
| # | Action | Effort | Severity | Finding |
|---|---|---|---|---|
| 1 | Fix CORS default to never allow wildcard; add production env var guard | 30 min | Critical | S-2, R1 |
| 2 | Enable SSL certificate verification for RDS connections (rejectUnauthorized: true) |
1 hr | Major | S-4, R1 |
| 3 | Make the /health endpoint test DB connectivity and return 503 on failure |
30 min | Major | D-2, R2 |
| 4 | Replace all console.error audit catch blocks with structured failure logging; no more silent swallowing |
1–2 hrs | Major | S-5, R4 |
| 5 | Implement per-tenant tiered rate limiting with keyGenerator |
2 hrs | Major | S-6, R3 |
| 6 | Add user management endpoints (POST/GET /api/users) before first real tenant onboarding |
1 day | Major | Product Gap |
| 7 | Add GET /api/precedents/:id/similar endpoint — core product value delivery |
1 day | Major | Product Gap, R5 |
| 8 | Add audit log query endpoint (GET /api/audit/log) with tenant scoping and date range filters |
1 day | Major | HIPAA Gap |
| 9 | Move pipeline trigger and monthly allocation to a background job queue (graphile-worker or pg-boss) | 2–3 days | Major | P-1, A1 |
| 10 | Implement structured logging with Pino + PHI-safe request serializer; remove bare console calls | 1 day | Major | S-3, A2 |
| 11 | Complete Auth0/Clerk Phase 2 integration — replace X-User-Id with JWT bearer tokens | 3–5 days | Critical | S-1 |
| 12 | Add startup env var validation and corpus-data directory check; fail fast rather than at first request | 2 hrs | Minor | Cold-Start Gap |
| 13 | Generate OpenAPI spec from Zod schemas; serve at /api/openapi.json |
1 day | Minor | D-1, A3 |
| 14 | Fix dynamic require() in export audit function to static import |
5 min | Minor | C-1 |
| 15 | Add Zod validation to all registry query params for consistent developer error messages | 2 hrs | Note | A-5 |