Skip to main content

Mongo / Mongoose Hardening v1

Overview​

This plan addresses high-severity correctness, performance, and tenant-safety issues uncovered during a Mongo/Mongoose audit of attunelogic-api. Every phase is designed so code and data ship together β€” no maintenance windows, no big-bang migrations. Each phase is independently revertible and gated on telemetry from the prior phase.

Canonical decisions agreed up front:

  • parentCompany will be unified to Schema.Types.ObjectId with ref: "Customer" across every model.
  • Phase 6 (the unification) pilots on a single internal tenant first, then expands per-tenant, then globally.
  • Phase 3 (cascading-hook removal) uses MongoDB transactions for billing/leg/job atomicity; transaction boundaries are designed in Phase 3a as part of the recalc service.
  • This document is the source of truth for sequencing; phase numbers are referenced from PR titles.

Operating Principles​

  1. Additive first, destructive last. Pattern: add the new path β†’ dual-read/dual-write β†’ backfill β†’ cut over β†’ remove the old path.
  2. No offline migrations. All backfills are online, idempotent, resumable scripts with checkpoints in a migrations collection.
  3. Feature flags for risky cutovers. Use the existing tenant feature-flag plumbing (src/services/config/default-configs/feature-flags.js) to flip behavior instantly.
  4. Atlas rolling index builds only. Never call syncIndexes() on app boot in production. Index rollouts use the scripts/manage-indexes.js path during off-peak windows.
  5. Release gates. Each phase ships behind its own PR + release notes. No phase merges until the prior phase has been in production β‰₯ 24h with green telemetry.

Phase Summary​

PhaseGoalEst.Depends onRiskReversible
0Foundation & safety nets1dβ€”LowYes
1Query-safety hot patchesΒ½d0LowYes
2Index coverage1–2d0LowYes
3Kill cascading Leg/Job hooks1w0, 2MediumYes (flag)
4Remove Invoice.post('find')2–3d3MediumYes
5Retire mongoose-autopopulate1–2w0Low–Medium per modelYes
6Unify parentCompany to ObjectId3–4w0, 5HighYes (dual-read)
7Tenant-scoped uniques2d6 (partial)LowYes
8Ergonomics & hygiene1wβ€”LowYes

Total: ~6–8 engineering weeks. Any single week's output can ship to production on its own.


Phase 0 β€” Foundation & safety nets​

Pure additive scaffolding the rest of the plan needs. No data changes.

Changes:

  • src/db/index.js
    • Set mongoose.set("strictQuery", true) explicitly.
    • Set mongoose.set("sanitizeFilter", true) globally to defend user-supplied filters.
    • Add pool/timeout options: maxPoolSize, serverSelectionTimeoutMS, socketTimeoutMS, maxIdleTimeMS.
    • Add SIGTERM handler mirroring the existing SIGINT path so containers drain cleanly.
  • New src/utils/tenantQuery.js
    • normalizeTenantFilter(value) returns { $in: [stringForm, objectIdForm] } when the value is a valid 24-char hex. Not yet wired up; load-bearing for Phase 6.
  • New src/models/plugins/tenantPlugin.js
    • Adds parentCompany index if missing, logs/throws when a non-super-admin query omits parentCompany. Modes: off | warn | throw. Default off; enabled per env via TENANT_QUERY_STRICT.
  • Expand tests/security/tenantIsolation.test.js to cover Invoice, Payout, Manifest, Route, Media, Maintenance getById endpoints.

Backward compat: fully additive. No schema or data impact. Rollback: single PR revert.


Phase 1 β€” Query-safety hot patches​

Real security bugs that should not wait for the rest of the plan.

Changes:

  • src/controllers/legs/index.js β€” the customQuery merge currently lets a client override parentCompany via JSON.parse. Fix:
    • Whitelist allowed keys from the parsed object.
    • Re-apply { parentCompany } after the merge.
  • Audit every other JSON.parse / spread-merge-into-Mongo-filter (start with src/controllers/reports.js).
  • Confirm sanitizeFilter from Phase 0 is active per query (Mongoose 8 honors it).

Backward compat: legitimate client filters continue to work; only malicious overrides stop working. Rollback: trivial.


Phase 2 β€” Index coverage​

Indexes ship before code that depends on them. Atlas builds compounds online.

Changes:

  • Extend scripts/manage-indexes.js (or add scripts/sync-mongoose-indexes.js) that runs Model.syncIndexes() for the listed models, with explicit dry-run, progress logging, and per-model gating. Do not wire syncIndexes() into app boot.
  • Add the following indexes to the schemas (additive only; keep existing ones):
    • Leg:
      • { parentCompany: 1, driver: 1, appointmentDate: -1 }
      • { parentCompany: 1, jobId: 1, sortKey: 1 }
      • { parentCompany: 1, manifestId: 1 }
      • { parentCompany: 1, payoutId: 1 }
      • { parentCompany: 1, status: 1, appointmentDate: -1 }
    • Job:
      • { parentCompany: 1, status: 1, transactionDate: -1 }
      • { parentCompany: 1, client: 1 }
      • { parentCompany: 1, invoiceId: 1 }
    • Invoice:
      • { parentCompany: 1, status: 1, dueDate: -1 }
      • { parentCompany: 1, client: 1 }
    • Maintenance, Payout, Media, Accessory: at minimum { parentCompany: 1, dateCreated: -1 }.
    • Client: { parentCompany: 1, active: 1, name: 1 }.
  • Remove parentCompany from the Client text-index weights (no type change, just a weight cleanup).

Backward compat: pure additive. If a build is slow or expensive we abort and drop the partially-built index. Rollback: drop the new indexes; existing query paths continue to work (just slower). Gate: Atlas slow-query log shows reduction before proceeding to Phase 3.


Phase 3 β€” Kill cascading hooks on Leg / Job​

The highest-ROI correctness fix. Three sub-phases so each deploy is independently safe.

3a. Introduce explicit recalc services (additive)​

  • New src/services/jobs/recalculateJobDuration.js (pure function, accepts jobId and optional session).
    • Source of truth should be the Leg collection (Leg.find({ jobId, parentCompany })), not job.legs, because some flows can create or update legs without keeping the embedded job leg array perfectly synchronized.
    • Query narrowly: select only fields needed for duration math and populate only route timing fields (durationSeconds, fullData.duration). Prefer durationSeconds first, matching scheduler timing behavior.
    • Run only when duration-affecting data changes: leg create/delete, leg route changes, or route timing changes. Do not run on driver/status/note-only leg updates.
    • Expected write-path cost: one indexed leg query, narrow route timing populate, and one Job.updateOne. This should be no slower than the current hidden hook chain and more predictable as long as Phase 2 indexes are in place.
    • If high-leg-count jobs show measurable latency, add a leg-level duration snapshot so job duration can be summed without route populate.
  • New src/services/legs/recomputeLegTotals.js.
  • New src/services/invoices/recalculateInvoiceTotals.js (also feeds Phase 4).
  • Define transaction boundaries:
    • Leg write + dependent Job recalc β†’ one transaction.
    • Invoice write + Invoice totals recalc β†’ one transaction.
    • Payout creation + Leg payoutId assignment β†’ one transaction.
    • Use session.withTransaction() (Atlas replica set required β€” already true).
  • Unit tests; integration test that proves each new service produces identical results to the pre-hook version against seeded fixtures.

3b. Feature-flag the hooks off and call services explicitly​

  • Introduce features.legacyMongoHooks flag (default true, per tenant).
  • Wrap every Leg.pre("save"|"findOneAndUpdate"|"create") and jobMixin hook in if (!flags.legacyMongoHooks) return next();.
  • Update every controller/service that writes Legs/Jobs to call the explicit recalc inside the transaction.
  • QA: run tests/controllers/legs/* with the flag both on and off. Both must pass.

3c. Rollout & removal​

  • Flip flag to false on the internal tenant first, monitor 48h.
  • Flip per-tenant for a small pilot, then all.
  • After 1 week clean, delete the hooks entirely.

Backward compat: flag flips behavior instantly. Data is unchanged β€” the hooks were only maintaining derived fields the new services also maintain. Rollback: flip flag back on. Gate: zero divergence between pre-flag and post-flag job.estimatedDuration in shadow comparison on a sampled tenant.


Phase 4 β€” Remove Invoice.post('find' | 'findOne')​

4a. Backfill calculatedTotalCharges (one-time migration)​

  • New scripts/backfill-invoice-calculated-totals.js. Iterates invoices in tenant-scoped batches, populates jobs with charges.*, writes calculatedTotalCharges. Idempotent (skip when already set), supports --dry-run, resume token in migrations collection.

4b. Read code change​

  • Controllers always read calculatedTotalCharges. If missing (0/null), call recalculateInvoiceTotals(invoiceId) lazily β€” but only inside the write path, never inside a GET.

4c. Remove the post('find') and post('findOne') hooks​

  • Only after backfill is 100% complete and 48h of green telemetry on the new read path.

Backward compat: during 4a/4b both the hook and the backfill produce the same value, so double-compute is harmless. During 4b old clients still work. Only 4c removes behavior. Rollback: revert 4c; the hook returns.


Phase 5 β€” Retire mongoose-autopopulate​

Model by model. Each model is one PR.

Order (lowest blast-radius first):

  1. Route
  2. Location
  3. Manifest
  4. Payout
  5. Invoice
  6. Leg
  7. User
  8. Customer

Per-model checklist:

  1. Search every populate site for the model's references.
  2. Convert each to explicit .populate({ path, select }).
  3. Add mongoose-lean-virtuals to the schema (so lean({ virtuals: true }) works).
  4. Remove autopopulate: true from schema paths and remove the plugin registration.
  5. Run full test suite + relevant QA matrix before next model.

Backward compat: per-model. Missing explicit populate yields an unpopulated field rather than wrong data; caught in tests before merge. Rollback: revert the PR; plugin flips back on.


Phase 6 β€” Unify parentCompany to ObjectId​

The trickiest phase. Pattern: dual-read, single-write-canonical, background migrate, flip schema. Uses normalizeTenantFilter() from Phase 0 as the load-bearing abstraction.

6a. Dual-read helper, canonical-write (one PR)​

  • Adopt normalizeTenantFilter(value) in every query that filters by parentCompany. Audit by grepping for { parentCompany and converting each call site.
  • Introduce a write helper that always writes parentCompany as an ObjectId. Existing string docs stay strings until the per-collection migration runs.
  • No schema type changes yet.
  • Add metric: tenant_query.fallback_hits β€” counter of queries where the dual-read fallback returned a string-shaped match.

6b. Per-collection background migration​

  • One migration script per affected collection (Leg, Job, Client, Route, Address, Location, Task, Accessory, Driver, Equipment, History, InventoryItem, Media, Message, Organization, Recipe, Shift, TaskTemplate, TaskTemplateGroup, TimelineItem, ChatConversation).
  • Each script:
    1. Walks the collection in batches.
    2. Skips already-ObjectId docs.
    3. Updates via updateOne with { $set: { parentCompany: ObjectId(...) } }.
    4. Tracks progress in migrations collection: { name, lastId, completedAt }.
  • Pilot order (least downtime):
    1. Internal tenant only (--tenant=ferda-tech). Validate end-to-end for 48h.
    2. Smallest 25% of tenants by document count. Validate 48h.
    3. Next 50%. Validate 24h.
    4. Largest tenants.
  • Per-collection order within each tenant batch: leaf collections first (Address, Accessory, Media), then mid-tier (Route, Location, Client), then high-volume (Leg, Job, Task).

6c. Flip schema type​

  • Once a collection reports 100% ObjectId for all tenants AND tenant_query.fallback_hits == 0 for 48h, change the schema type from String to Schema.Types.ObjectId, ref: "Customer".
  • Keep the dual-read helper in place for one more release in case of stragglers, then simplify it to a straight equality match.

6d. Standardize ref​

  • While in there, ensure every model declares ref: "Customer" on its parentCompany so populate and $lookup work uniformly.

Backward compat: the dual-read helper makes shape irrelevant to query results. Writes are canonical from 6a forward, so no drift accumulates. Rollback: at any point, revert the most recent PR; the dual-read helper keeps prior data readable. Gates:

  • 6a ships only when every production query path uses normalizeTenantFilter.
  • 6b marks a collection "done" only when migrations.completedAt is set for every tenant and tenant_query.fallback_hits == 0 for 24h.
  • 6c ships per-collection, never bulk.

Phase 7 β€” Tenant-scoped uniques​

Blocked on Phase 6c for the affected models, but Organization work can start during 6b.

Changes:

  • Pre-flight collision scan: scripts/check-tenant-unique-collisions.js for each (parentCompany, field) pair. Resolve collisions (rename) before applying the constraint.
    • Organization: (parentCompany, name)
    • Client: (parentCompany, slug)
  • Build the non-unique compound index first; verify zero duplicates via aggregation.
  • In a separate PR: drop the global unique: true, rebuild as { unique: true } compound index.
  • Slug/name generators stay tenant-aware (Organization is user-supplied; Client.slug already tenant-scoped semantically).

Backward compat: non-unique β†’ validate β†’ unique is fully safe. Old global unique is dropped only after the new compound unique is confirmed built. Rollback: drop the new unique, restore the old one if collisions appear (after another rename pass).


Phase 8 β€” Ergonomics & hygiene​

Can interleave with later phases. Low risk.

Changes:

  • src/models/Leg.js β€” remove duplicate plugin registrations.
  • New src/models/plugins/baseSchemaPlugin.js that applies the standard plugins (paginate, aggregate-paginate, lean-virtuals) once. Migrate every model to use it.
  • Standardize timestamps: keep dateCreated/dateUpdated (renaming is too expensive across consumers), but add explicit indexes where list endpoints sort by them.
  • Remove deprecated useNewUrlParser / useUnifiedTopology from scripts/*.js.
  • Schema.Types.Mixed audit: switch in-place mutations to assignment (doc.metadata = { ...doc.metadata, ... }) or add markModified calls. Focus models: Job.metadata, Invoice.totals, Leg.calculationMetadata, Config.configs, Customer.metadata.
  • Extract typed sub-schemas for stable shapes: Job.approval.history, Leg.statusHistory, Leg.statusExceptions.

Observability prerequisites​

Must be in place before Phase 3:

  • Mongoose query timing histogram (extend src/middlewares/performanceMonitor.js).
  • Counter: tenant_query.fallback_hits per collection (Phase 6).
  • Counter: mongoose_autopopulate.activations per collection (Phase 5). Wrap the plugin during the transition.
  • Atlas slow-query alerts on legs, jobs, invoices.
  • Feature-flag state snapshot exported with every request log.

Rollback Strategy by Phase​

PhaseRollback action
0PR revert
1PR revert
2Drop new indexes via scripts/manage-indexes.js --drop
3Flip features.legacyMongoHooks back to true
4Revert 4c PR; hook returns; backfill remains correct
5Revert per-model PR; autopopulate plugin returns
6Revert most recent PR; dual-read helper keeps both shapes readable
7Drop new compound unique; restore prior global unique
8PR revert

Open Items​

  • Tenant pilot list for Phase 6: confirm ferda-tech as the canary, plus 2 friendly customers willing to be in the smallest 25% batch.
  • Transaction support: confirm Atlas tier supports cross-document transactions on the legs/jobs/invoices collections (M10+ replica set β€” already true).
  • CI gate: decide whether to add a CI check that fails when a new schema introduces parentCompany: String (so we don't regress during Phase 6 rollout).

References​