7 Months to Modular: Re-Architecting a Multi-Tenant SaaS Platform Without Stopping Delivery

The system before

The platform was a multi-tenant SaaS application serving organizations that needed financial ledger, identity, and reporting capabilities in one product. It had been built and extended over several years by teams with different priorities at different moments. The result was predictable: a system that worked but could not evolve quickly.

The constraints that triggered the re-architecture were specific.

Adding a new module took months. The core application had no extension points. A new capability meant modifying the shared data model, the monolithic application layer, and the test suite. Engineers had to navigate the entire existing system just to understand what they might break. Estimates for a new module ran 2 to 4 months.

Tenant isolation depended on developer discipline. Tenant scoping was enforced at the application layer, which meant developers had to remember to add the right filter to every query. Security reviews had caught multiple cases where the filter was missing. There was no structural guarantee. The risk was real and it kept coming back.

Deployment was a coordinated event. Every deployment touched the full system. A change to the reporting module required deploying everything, including the financial ledger module it had nothing to do with. Deployment frequency was low and falling. Batching changes was producing larger, riskier releases.

Module communication was hardcoded. When the ledger module needed to notify the reporting module, it called it directly, a hard in-process dependency. Moving to async communication later would mean rewriting both sides.

The first hypothesis

The initial approach was a standard layered modular monolith: extract modules with clear boundaries, keep them in one deployable, communicate in-process. That would address the "adding a module takes months" constraint and give teams independent ownership of their code.

The hypothesis held for the first phase. But it exposed a second problem. The in-process communication pattern, simple as it was, was wiring modules together in a way that would make independent deployment impossible later. The way module A called module B looked the same whether the two were "logically separate" or not.

The re-strategization at month three was a big one. The communication layer had to be abstracted before it got too embedded to change. The dispatch abstraction (a single interface that sends commands and publishes events without the caller knowing or caring whether the transport is in-process or a message bus) moved from a Phase 4 concern to a Phase 2 decision.

That meant rework. It was the right call.

The architecture that emerged

The module contract

Every module follows the same structure:

Module/
├── Contracts       (commands, queries, integration events - shared across modules)
├── Domain          (entities, value objects, domain events - zero infrastructure)
├── Application     (handlers, validators, DTOs - consumes domain and contracts)
├── Infrastructure  (database context, repositories, external clients)
└── Api             (IModuleStartup implementation - wires DI and HTTP endpoints)

The Contracts layer is the only public surface. Other modules import contracts, never domain or application code. The domain layer has zero infrastructure dependencies, enforced at build time by architecture tests. A domain layer that imports Entity Framework Core fails the build.

The IModuleStartup contract is how the platform discovers and loads modules:

IModuleStartup
  ├── ModuleName         - identity
  ├── ConfigureServices  - register all module dependencies
  └── MapEndpoints       - declare HTTP routes under /api/v1/{module}/*

At application startup, the module loader scans assemblies, discovers every IModuleStartup implementation via reflection, and invokes them in dependency order. Adding a new module means two things: implement the interface, register the assembly. Nothing else in the platform changes.

The transport abstraction

The dispatch layer is the most consequential architectural decision in the whole system.

graph TD
    subgraph CALLER["Handler (any module)"]
        CMD[Command or Event]
    end

    subgraph DISPATCHER["IModuleDispatcher"]
        ROUTE{Transport Config}
    end

    subgraph INPROC["In-Process (Phase 1-3)"]
        MT_MED[MassTransit Mediator]
        HANDLER_A[Target Handler - same thread]
    end

    subgraph BUS["Out-of-Process (Phase 4+)"]
        MT_BUS[MassTransit Bus]
        RABBIT[RabbitMQ]
        HANDLER_B[Target Handler - remote consumer]
    end

    CMD --> DISPATCHER
    ROUTE -->|inproc| MT_MED
    ROUTE -->|bus| MT_BUS
    MT_MED --> HANDLER_A
    MT_BUS --> RABBIT
    RABBIT --> HANDLER_B

The calling module uses IModuleDispatcher.SendAsync or PublishAsync. Whether the message travels in-process or over the bus is a configuration value:

"Dispatch": {
  "Ledger → Reporting": "inproc",
  "Identity → Notifications": "bus"
}

No handler code changes when a module edge moves from in-process to bus. The abstraction isolates the transport detail. It proved its value when two module edges moved to async bus communication in month six without anyone touching either module's business logic.

Tenant isolation: defense in depth

The tenant isolation design took the most explaining to stakeholders who wanted to ship faster. It is also the part that delivered the most durable value.

The system enforces tenant isolation at two independent layers.

The first layer is an application-level ambient context. When an HTTP request arrives with a JWT carrying a tenant_id claim, middleware extracts it and stashes it in an AsyncLocal variable. Every database query in that async call tree automatically picks up WHERE tenant_id = [current] through a global query filter. Handlers cannot forget the filter because they never apply it. The context does.

The second layer is database row-level security. Before executing SQL, a connection interceptor issues SET LOCAL app.tenant_id = '[guid]' at the database session level. A Postgres row security policy enforces USING (tenant_id = current_setting('app.tenant_id')::uuid). Even if the application filter is missing, the database returns zero rows for a cross-tenant query.

Both layers have to fail at the same time for a cross-tenant data exposure. The combination turns a per-developer discipline problem into a structural guarantee. Security review stopped flagging it after month four.

Load testing as hypothesis validation

The system's performance claims were not assumed. They were tested before cutover. The load testing strategy was built to answer specific questions, not to spit out a pass/fail number.

The questions that drove the load test design:

Does the dispatch abstraction add meaningful latency when running in-process versus direct calls?
At what request volume does the in-process transport need to move to async bus?
Does the tenant context propagation hold correctly under concurrent multi-tenant load?

SLO targets established before testing:

Endpoint	Target
POST transaction (write)	p95 under 150ms
GET account balance (read)	p95 under 80ms
Token endpoint (auth)	p95 under 250ms
Saga completion (multi-step)	p99 under 2 seconds

The k6 load tests ran against a production-equivalent environment with representative multi-tenant data volumes. The dispatch abstraction overhead was measurable but under 5ms in-process, well within acceptable bounds. Tenant context propagation held under concurrent load, confirmed by cross-tenant isolation checks in the test suite.

The load tests also turned up something we did not expect: outbox delivery lag under high write volume. The transactional outbox pattern (write the message to a database table atomically with the business data, then deliver to the bus asynchronously) was the right choice for reliability, but the delivery service needed tuning at sustained high write rates. We found it and fixed it before cutover, not in production.

The re-strategization moments

Seven months of architectural work does not run in a straight line. Here is where the plan changed.

Month 3: the transport abstraction moved earlier. The original plan deferred the dispatch abstraction to Phase 4. We caught the risk of embedding direct in-process calls early enough to fix it without a full rewrite.

Month 4: saga compensation scope shrank. The original saga design included compensation workflows for every conceivable failure mode. Stakeholder review showed that some failure modes had acceptable manual resolution paths. We scoped the saga implementation to the flows where automated compensation was genuinely necessary, which saved 3 weeks of implementation and test work.

Month 5: migration sequencing changed. The original plan migrated tenants alphabetically, for simplicity. Integration testing revealed that two tenants had unusual configuration patterns that would stress the new system in ways the standard test suite did not cover. We moved those two earlier in the sequence, so problems showed up while there was still time to fix them.

Month 6: architecture test coverage expanded. The initial build-time enforcement caught the obvious boundary violations. After a code review found a subtle dependency violation the tests had missed, we expanded coverage to include cross-module contract usage patterns. Extra work, not in the original plan, but it closed off a class of risk.

Every one of these required explaining the change to stakeholders. The explanation always had the same shape: here is what we planned, here is what we learned, here is the specific risk the change addresses, here is the timeline impact.

The results

After seven months of incremental work, parallel running, and a phased cutover:

New module delivery time. A new module implementing the bootstrap contract, with its own schema, handlers, and tests, can be added without modifying any existing module. Delivery for a new capability module went from 2 to 4 months (navigating entangled code) down to 5 to 10 days (implementing against a known interface).

Tenant isolation. Structural now. Two independent enforcement layers. Security review confirmed no missing filters in the new architecture, because the filters cannot be missing: individual queries do not apply them.

Deployment. Module boundaries are enforced at the code level. A change to the reporting module does not touch the ledger module's deployment artifact. The blast radius is now the modules actually changing.

Transport flexibility. Two module edges have moved to async bus communication since cutover. Zero handler rewrites. Configuration change only.

What it required beyond the architecture

The technical architecture was the solvable part. The harder part was holding stakeholder confidence over seven months of work that was incrementally valuable but not fully visible until the parallel running phase.

What worked was weekly summaries that translated each phase's technical progress into the business constraint it addressed. Not "we implemented the dispatch abstraction this week" but "the communication layer is now configuration, so the two module edges we discussed can move to async without rewriting handlers, which unblocks the reporting scale requirement from Q2."

The team that delivered this understood both why the architecture had to change and what business outcome the change was supposed to produce. That combination, not the specific technology choices, is what made the re-strategization moments navigable instead of disruptive.