AWS SaaS Multi-Tenant Architecture Guide - Tenant Isolation, Pool and Silo Models, Onboarding, and Metering

First Published: 2026-06-18
Last Updated: 2026-07-26

The hardest part of building software-as-a-service (SaaS) on AWS is not adding a tenant_id column. It is being able to say, with confidence, that a bug in your application code can never let one tenant read or write another tenant's data. That property is called tenant isolation, and it is the spine of every serious multi-tenant architecture.

This guide is a Level 400 implementation walkthrough of one named reference architecture for a multi-tenant SaaS application on AWS. Instead of surveying every option, it follows a single request from sign-in to data access and shows where the tenant boundary is established, how it is enforced at runtime by the AWS platform (not just by careful coding), how new tenants are onboarded, and how per-tenant usage is metered. The constituent services — Amazon Cognito, Amazon API Gateway, AWS Lambda, Amazon DynamoDB, AWS KMS, AWS Step Functions, AWS IAM, and AWS STS — are each treated at implementation depth, but the focus is on how they compose into a system whose isolation guarantees hold even when application logic is wrong.

Deep selection between individual services is delegated to existing Decision Guides; this article concentrates on the build. Throughout, the emphasis is deliberately conservative: enabling a feature or attaching a policy is necessary but never sufficient. Isolation must be designed, tested, and continuously observed.

Note: This article does not include pricing. Multi-tenant cost allocation and metering are discussed only as usage (requests, items, storage, compute time), never as money. For current rates, always consult the official AWS Pricing pages for each service.

1. Introduction: Tenant Isolation Is the Architecture

A multi-tenant SaaS application serves many customers (tenants) from shared infrastructure to gain cost and operational efficiency. The moment infrastructure is shared, a new risk appears: one tenant reaching the resources of another. AWS frames the constructs that prevent this as tenant isolation — your architecture introduces controls that tightly scope access to resources and block any attempt to cross a tenant boundary.

It is worth being precise about what isolation is not. It is not the same as authentication (proving who a user is) or coarse authorization (proving a user may call an API). A user can be perfectly authenticated, fully authorized to call GET /documents, and still — through a missing WHERE clause or a forged identifier — receive another tenant's documents. Isolation is the additional layer that makes that outcome structurally impossible, ideally by pushing the boundary down into the AWS control plane where application bugs cannot reach it.

Consider the difference concretely. A handler reads tenantId from a query string and runs SELECT * FROM documents WHERE tenant = :tenantId. Authentication passed; the API was authorized; yet a caller who substitutes another tenant's identifier reads that tenant's documents, because nothing below the application verified that the caller belongs to the tenant it named. Real isolation removes the application's ability to make that mistake: the credentials the code runs with are already constrained to one tenant, so the equivalent query against another tenant's data is denied by the platform regardless of what the parameter says.

This guide covers three pillars of a multi-tenant build:

Isolation — the pool, silo, and bridge models, and how IAM, STS session tags, and ABAC enforce the boundary at request time.
Onboarding — the control-plane workflow that registers a tenant and provisions its identity, policies, data, and routing.
Metering — capturing per-tenant usage as a dimension on logs and metrics so the business can understand consumption (and hand it to a billing system) without leaking cost figures into the architecture itself.

For the individual building blocks, this article delegates to existing in-depth guides rather than repeating them: identity provider integration to the Amazon Cognito Federation Complete Implementation Guide, the order in which IAM policy types are evaluated to IAM Policy Evaluation Logic Step-by-Step, DynamoDB data modeling to the Amazon DynamoDB Single Table Design Complete Guide, account-level separation to AWS Multi-Account Operational Patterns, and fine-grained application authorization to the AWS Verified Permissions and Cedar Policy Language Complete Guide.

2. The Reference Architecture at a Glance

Every SaaS system decomposes into two planes, a separation AWS makes explicit in the SaaS Architecture Fundamentals whitepaper:

Control plane — the global, non-multi-tenant services that onboard, authenticate, manage, operate, and analyze tenants. It includes the tenant registry, the onboarding/provisioning workflow, tiering policy, and the operational and metering aggregation. There is exactly one control plane, and it is shared administrative machinery, not tenant workload.
Application plane — the multi-tenant application itself: the request path that runs tenant business logic and provisions or accesses per-tenant resources. This is where tenant isolation must be enforced on every request.

Our named reference architecture — call it the Tenant-Scoped Request architecture — wires the application plane as a serverless request path and keeps the control plane as a separate, event-driven set of services.

Application plane (the request path):

Service	Role in one line
Amazon Cognito	Authenticates the user and issues tokens that carry the tenant identity (`TenantID`).
Amazon API Gateway	Validates the token, establishes tenant context at the edge, and applies per-tenant rate limits.
AWS Lambda	Runs tenant-scoped business logic; exchanges the token for tenant-scoped temporary credentials.
AWS STS	Vends short-lived, tenant-scoped credentials by mapping the tenant tag into the session.
Amazon DynamoDB	Stores tenant data; the partition key is constrained to the caller's tenant via `dynamodb:LeadingKeys`.
AWS KMS	Encrypts tenant data with per-tenant keys and an encryption context bound to the tenant.

Control plane (administration and operations):

Service	Role in one line
AWS Step Functions	Orchestrates tenant onboarding and provisioning as a durable workflow.
Amazon DynamoDB (tenant registry)	Holds tenant metadata, tier, status, and the mapping to per-tenant resources.
AWS Lambda	Provisioning steps (create identity, policies, data, routing) and metering aggregation.
Amazon CloudWatch / AWS CloudTrail	Per-tenant usage metrics, structured logs, and an audit trail of who accessed what.

SaaS multi-tenant reference architecture: control plane and application plane

Two design decisions are visible in this picture and worth stating up front. First, the boundary that matters most lives between Lambda and the data tier: the Lambda function never uses its own broad execution-role permissions to touch tenant data directly. It first narrows itself to a single tenant by acquiring tenant-scoped credentials, and only those scoped credentials reach DynamoDB and KMS. Second, the control plane is deliberately separate from the request path so that onboarding logic, which runs with powerful provisioning permissions, is never reachable from tenant-facing code.

The remainder of this guide walks a single request through this architecture, then steps into onboarding and metering, and finishes with the failure modes you must be able to diagnose.

3. Tenant Isolation Models: Pool, Silo, and Bridge

Before walking the request, you must choose how tenants share — or do not share — each resource. AWS SaaS guidance describes three models. They are usually discussed for the data tier, but the same vocabulary applies to compute, identity, and routing.

Thinking of the three models as a per-resource choice, rather than a single system-wide setting, is what unlocks a pragmatic design. The compute tier can be pooled (one fleet of Lambda functions serving all tenants) while the data tier is bridged (pooled for basic tenants, siloed tables for premium). Identity can be a single shared user pool while storage is per-tenant prefixed. Each resource gets the model that matches its isolation requirement and its efficiency pressure, and the tenant context (Sections 4 and 5) is the common thread that makes every one of those per-resource decisions enforceable.

Silo — each tenant gets dedicated resources. In the data tier this is a separate database (or DynamoDB table, or even a separate AWS account) per tenant. Isolation is the strongest and simplest to reason about because the boundary is a resource boundary. The trade-offs are higher operational overhead, a larger management surface, and less efficient resource pooling.
Pool — tenants share the same resources, and isolation is enforced logically at the row/item level. In DynamoDB this means a shared table where the partition key carries the tenant identifier and access is constrained with dynamodb:LeadingKeys. Pool is the most efficient and the most scalable to operate, but isolation now depends on a correctly enforced runtime control rather than a physical boundary — which is exactly why Sections 5 and 6 matter.
Bridge — a mixture. Some resources are siloed, others pooled. The AWS Guidance for Multi-Tenant Architectures frames the database flavor of bridge as a separate schema per tenant inside a shared database. More generally, bridge lets you silo what must be isolated for compliance or noisy-neighbor reasons while pooling the rest for efficiency.

Tenant isolation models compared: pool, silo, and bridge

A useful way to choose is by tier. Many SaaS providers run a pooled model for their basic tier and a siloed model for premium tenants that require dedicated capacity, a dedicated encryption key, or a data-residency guarantee. The decision is rarely all-or-nothing:

* You can sort the table by clicking on the column name.

Dimension	Silo	Pool	Bridge
Isolation boundary	Physical (resource per tenant)	Logical (runtime-enforced per item)	Mixed
Blast radius of a defect	One tenant	Potentially all pooled tenants	Limited to pooled subset
Operational overhead	High (many resources)	Low (shared)	Medium
Resource efficiency	Lower	Higher	Medium
Typical fit	Premium / regulated tenants	Basic / high-density tiers	Tiered products
Cost characteristics	See official Pricing	See official Pricing	See official Pricing

The critical engineering insight is that pool isolation is only as strong as its runtime enforcement. A silo fails closed almost by construction — there is no path from tenant A's compute to tenant B's table because the table simply is not referenced. A pool fails closed only if every code path acquires tenant-scoped credentials before touching shared data and the platform rejects out-of-scope access. The rest of this guide is largely about making the pooled path fail closed.

4. Identity and Tenant Context with Amazon Cognito

Isolation starts with a trustworthy answer to one question: which tenant is this request for? That answer must originate from the identity layer and be carried, tamper-proof, all the way to the data tier. Amazon Cognito is the identity provider in this reference architecture; the deep treatment of federation and provider integration is in the Amazon Cognito Federation Complete Implementation Guide, so here we focus only on representing tenant identity.

4.1 Representing tenant identity

Cognito supports several documented multi-tenancy patterns, each a point on an isolation-versus-effort curve:

User pool per tenant — maximum isolation and per-tenant configuration (separate sign-up, hosted UI, MFA, threat-protection policy), at the cost of higher automation and operational effort. Best when tenants need genuinely different authentication behavior.
Shared user pool with a custom attribute — a single pool with a custom:tenantID attribute distinguishing tenants. Best when the differences are surface-level (branding, layout); it shifts tenant-based authorization into the application and the tokens.
Shared pool with SAML/external IdPs per tenant — each tenant authenticates through its own corporate IdP while a custom claim (for example tenantName) marks the tenant in the issued token.

There is no single correct choice. The pattern you pick affects sign-up flow, blast radius, and quota math, which is why AWS publishes it as guidance rather than a default.

Note on quotas: Amazon Cognito quotas are applied per AWS account and AWS Region and are shared across all tenants in a shared-pool model. If you pool tenants into one user pool, model your expected request volume against the Cognito service quotas (additional capacity can be purchased). Splitting tenants across accounts or Regions gives each its own quota and the highest isolation, at the price of replicating configuration. Treat this as a capacity-planning input, not an afterthought.

4.2 Putting the tenant into the token

The application plane should never have to look the tenant up from a side channel on every request; the tenant identity should be a verified claim inside the token. Cognito's pre token generation Lambda trigger customizes claims before a token is issued.

To embed the TenantID as something AWS STS can later turn into a session tag, add the https://aws.amazon.com/tags claim. Because that claim is a JSON object (a map of principal_tags), it requires the version two (V2_0) pre token generation event, which supports complex data types (arrays, maps, JSON) and access-token customization. Per the current Cognito documentation, the V2_0 and V3_0 events are available in the Essentials or Plus feature plan (V3_0 adds machine-to-machine client-credentials customization). Configure the trigger event version to Basic features + access token customization for user identities (or set LambdaVersion in the user pool's LambdaConfig).

# Cognito pre token generation Lambda (V2_0 event).
# Adds the https://aws.amazon.com/tags claim so STS can map TenantID to a session tag.
def lambda_handler(event, context):
    tenant_id = event["request"]["userAttributes"].get("custom:tenantID")
    if not tenant_id:
        # Fail closed: never issue a token without a tenant when one is required.
        raise Exception("Missing tenant assignment for user")

    event["response"]["claimsAndScopeOverrideDetails"] = {
        "idTokenGeneration": {
            "claimsToAddOrOverride": {
                "https://aws.amazon.com/tags": {
                    "principal_tags": {"TenantID": [tenant_id]}
                }
            }
        },
        "accessTokenGeneration": {
            "claimsToAddOrOverride": {"tenant_id": tenant_id}
        }
    }
    return event

Two practices keep this honest. First, the trigger fails closed: if a user has no tenant assignment, no token is issued. Second, the TenantID comes from a Cognito-managed user attribute (custom:tenantID) set during onboarding, not from anything the client can influence — the client never gets to assert its own tenant.

4.3 Establishing tenant context at the edge

API Gateway is the first place the token is checked. A REST or HTTP API can validate the Cognito token (a JWT authorizer for HTTP APIs, or a Cognito user pool authorizer / Lambda authorizer for REST APIs) and reject anything unauthenticated before it reaches Lambda. The selection between API Gateway flavors and authorizer types is covered in the messaging and API guides; what matters here is the contract: by the time a request reaches your business logic, the tenant identity is a verified value, and the raw, untrusted client input has not been allowed to set it.

A Lambda authorizer can also enrich the request context with the resolved tenantId, which downstream integrations and access logs can read. Even so, treat the authorizer's output as convenience, not as the isolation boundary — the boundary is enforced one layer deeper, in Section 5, where the tenant tag is bound into temporary credentials that the AWS platform itself enforces.

Two operational notes follow from this. First, validate the token's signature, issuer, audience, and expiry at the edge (the JWT or Cognito authorizer does this for you) so that a malformed or expired token never reaches business logic. Second, acquiring tenant-scoped credentials has a cost per call, so cache them for the life of the request (and, carefully, across requests for the same tenant within their short validity window) rather than re-assuming the role for every downstream operation. The cache key must be the tenant identity; never share a credential object across tenants, which would silently widen the boundary.

5. Runtime Isolation with IAM Session Tags and ABAC

This is the core of the architecture. A pooled Lambda function is deployed with an execution role broad enough to serve any tenant — it can read the shared DynamoDB table. On its own, that role is the opposite of isolation. The job of this section is to ensure that on each request the function narrows itself to exactly one tenant before touching shared data, and that the narrowing is enforced by IAM and STS rather than by application logic.

5.1 The three IAM isolation methods

AWS describes three primary ways to use IAM for tenant isolation:

Dynamically generated IAM policies — generate a fresh, fully scoped policy per request at runtime (the "token vending machine" pattern: a policy template with the tenant context filled in, used to mint scoped credentials). Powerful, but you own the correctness of the templates and the generation mechanism.
Role-based access control (RBAC) — a distinct IAM role per tenant. Conceptually simple, but the number of roles and policies grows with tenants and eventually collides with IAM limits, so it scales poorly for large tenant counts.
Attribute-based access control (ABAC) — a single role whose policy references the tenant as an attribute (aws:PrincipalTag/TenantID). One role serves all tenants; the tenant value is supplied per session as a session tag. This is the most scalable approach and the one this guide implements.

ABAC wins for pooled, high-tenant-count systems because the policy count does not grow with the tenant count. There is exactly one isolation policy; the tenant is a variable.

The trade-offs between the three are mostly about scale and flexibility. RBAC is the most familiar but the least scalable: a role and policy per tenant runs into IAM entity limits and becomes a management burden once you have thousands of tenants. Dynamically generated policies remove that ceiling by keeping policies transient (generated per request from a template), but you take on the correctness and security of the generation mechanism, and the per-request generation adds latency and moving parts. ABAC sits in between for most pooled designs: one policy, one role, and the tenant supplied as a session attribute — reach for dynamic generation only when a single ABAC policy cannot express the isolation rule (for example, when per-tenant resource names, not just a key prefix, must be injected).

5.2 Binding the tenant to a session with STS

A session tag is a key-value attribute passed when you assume a role or federate a user in AWS STS. Once set, it appears in the request context as aws:PrincipalTag/<key>, and policies can reference it in Condition elements. There are two ways to get the TenantID into the session, and both are legitimate:

AssumeRole with explicit session tags — the calling code passes Tags=[{Key: "TenantID", Value: tenant}]. Used when the trusted backend already knows the tenant and assumes a role itself.
AssumeRoleWithWebIdentity with a JWT tags claim — STS verifies the Cognito JWT, reads the https://aws.amazon.com/tags claim, maps TenantID into a session tag, and returns tenant-scoped credentials. This is the path our pre token generation trigger in Section 4 set up, and it keeps the tenant assertion cryptographically tied to the identity provider.

Which to use depends on where the tenant assertion is most trustworthy. AssumeRoleWithWebIdentity is attractive because STS itself verifies the token signature and extracts the tag, so a tampered or replayed token cannot inject a different tenant; the trust is anchored in the identity provider. Plain AssumeRole with explicit tags puts the responsibility on your backend to have established the tenant correctly before it sets the tag, which is fine for a trusted control-plane service but a larger surface to get wrong on the request path. For the web-identity path, you first register the identity provider in IAM so STS will trust its tokens:

aws iam create-open-id-connect-provider \
  --url https://cognito-idp.ap-northeast-1.amazonaws.com/ap-northeast-1_EXAMPLE \
  --client-id-list EXAMPLECLIENTID \
  --thumbprint-list 0000000000000000000000000000000000000000

For an OIDC provider whose certificate chains to a trusted root certificate authority — as Amazon Cognito's does — IAM secures the connection through its library of trusted CAs and does not actually validate the thumbprint, though the create-open-id-connect-provider API still requires the parameter.

One subtlety to know before you lean on tags in deeper call graphs: session tags can be marked transitive, in which case they persist across role chaining (one assumed role assuming another). For tenant isolation this is useful when a request crosses more than one role boundary and the TenantID must follow it, but it also means a transitive tenant tag cannot be changed downstream — which is exactly the property you want for a tenant identifier, and exactly the wrong property if you ever reuse the same tag key for something mutable. Pick a dedicated, never-reused key such as TenantID.

Either way, the role's trust policy must grant sts:TagSession, or the tagging operation fails:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Federated": "arn:aws:iam::111122223333:oidc-provider/cognito-idp.ap-northeast-1.amazonaws.com/ap-northeast-1_EXAMPLE" },
      "Action": ["sts:AssumeRoleWithWebIdentity", "sts:TagSession"],
      "Condition": {
        "StringEquals": {
          "cognito-idp.ap-northeast-1.amazonaws.com/ap-northeast-1_EXAMPLE:aud": "EXAMPLECLIENTID"
        }
      }
    }
  ]
}

The backend then exchanges the verified JWT for tenant-scoped credentials:

import boto3

sts = boto3.client("sts")

def get_tenant_scoped_credentials(jwt_token: str):
    # STS verifies the JWT, maps the TenantID tag from the
    # https://aws.amazon.com/tags claim into a session tag, and returns
    # credentials whose aws:PrincipalTag/TenantID is pinned to this tenant.
    resp = sts.assume_role_with_web_identity(
        RoleArn="arn:aws:iam::111122223333:role/tenant-scoped-data-access",
        RoleSessionName="tenant-request",
        WebIdentityToken=jwt_token,
    )
    return resp["Credentials"]

Be deliberate about session tag limits when you design the claim set, because exceeding them fails the call rather than silently truncating. Per the IAM documentation, an STS tagging operation fails if you pass more than 50 session tags, a tag key exceeds 128 characters, a tag value exceeds 256 characters, or the combined session policy plaintext exceeds 2,048 characters. Tenant isolation needs only one tag, so this is generous — but if you also carry tier, region, or feature flags as tags, keep the budget in mind.

5.3 The tenant-scoped policy

The role assumed above carries a single policy that constrains DynamoDB access to the caller's tenant by requiring the partition key (dynamodb:LeadingKeys) to equal the session's TenantID tag:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TenantScopedItemAccess",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:BatchGetItem",
        "dynamodb:Query",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:DeleteItem"
      ],
      "Resource": "arn:aws:dynamodb:ap-northeast-1:111122223333:table/AppData",
      "Condition": {
        "ForAllValues:StringEquals": {
          "dynamodb:LeadingKeys": ["${aws:PrincipalTag/TenantID}"]
        }
      }
    }
  ]
}

A few details are easy to get wrong and worth stating exactly. dynamodb:LeadingKeys represents the first (partition) key attribute of the table; the key name is plural even for single-item actions, and you must use the ForAllValues set-operator modifier with it. (dynamodb:FirstPartitionKeyValues is interchangeable with LeadingKeys.) The policy variable ${aws:PrincipalTag/TenantID} resolves to the session tag at evaluation time, so the same one policy enforces a different boundary for every session. With this in place, a Query that omits or forges the tenant partition key is denied by IAM — not by your code.

5.4 The runtime-scoping pattern and tenant context propagation

Putting Sections 4 and 5 together produces a single repeatable pattern, often described with an isolation manager (sometimes paired with a token vending machine): the microservice never accesses tenant data with its own broad scope; it first acquires tenant-scoped credentials for the current request, then uses only those.

One tenant-scoped request from login through Cognito, API Gateway, Lambda, and STS to DynamoDB LeadingKeys enforcement

import boto3

def handler(event, context):
    # 1. Tenant context is a verified claim from the authorizer / token, never client input.
    jwt_token = event["headers"]["authorization"].removeprefix("Bearer ")

    # 2. Narrow to one tenant: acquire tenant-scoped credentials for THIS request.
    creds = get_tenant_scoped_credentials(jwt_token)

    # 3. Use only the scoped credentials. The function's own broad role does not touch tenant data.
    ddb = boto3.client(
        "dynamodb",
        aws_access_key_id=creds["AccessKeyId"],
        aws_secret_access_key=creds["SecretAccessKey"],
        aws_session_token=creds["SessionToken"],
    )

    # A Query without the tenant's partition key, or with a different tenant's key,
    # is rejected by IAM via dynamodb:LeadingKeys - even if application logic is wrong.
    return ddb.query(
        TableName="AppData",
        KeyConditionExpression="pk = :t",
        ExpressionAttributeValues={":t": {"S": resolve_tenant_id(creds)}},
    )

The discipline here is a convention: every code path must acquire scoped credentials before touching tenant resources. The platform enforces the boundary, but the platform only sees the requests your code actually scopes, so the convention must be airtight and tested.

5.5 Why "configured" is not "isolated"

It is tempting to treat a green checkmark — trust policy has sts:TagSession, data policy has LeadingKeys — as proof of isolation. It is not. Isolation is a property of the whole request path and must be verified, not assumed:

Pass attribute restrictions through, not around. DynamoDB attribute-level conditions (dynamodb:Attributes) are only evaluated against attributes the request names; if a request supplies no ProjectionExpression, all attributes are returned regardless of the policy. To actually hide attributes you must also constrain dynamodb:Select and dynamodb:ReturnValues. A policy that "looks" attribute-scoped can leak everything if this is missed.
Default to deny and prove the negative. Write automated tests that attempt cross-tenant access with one tenant's session against another tenant's keys and assert an AccessDenied. A boundary you have not tried to break is a boundary you do not know works.
Beware unscoped paths. Any code that reaches DynamoDB with the function's own execution role — a cache warmer, an admin path, a "just this once" query — bypasses the entire mechanism. Keep the execution role's direct data permissions minimal and force everything through the isolation manager.

6. Data-Tier Isolation with DynamoDB and KMS

Section 5 established who may touch the data; this section is about how the data tier itself is partitioned and encrypted. Data modeling details — key design, indexes, single-table technique — are delegated to the Amazon DynamoDB Single Table Design Complete Guide; here the focus is the isolation surface.

6.1 Pool: item-level isolation with the leading key

In the pooled model, all tenants share one table and the partition key carries the tenant identifier. Combined with the dynamodb:LeadingKeys condition from Section 5.3, this gives item-level (horizontal) isolation: a principal can only read or write items whose partition key equals its tenant. AWS calls this fine-grained access control, and it is the canonical way to make a shared table behave as if each tenant had its own.

A practical schema keeps the tenant prefix in the partition key so it is impossible to query the table without naming a tenant:

pk (partition key)            sk (sort key)         attributes...
TENANT#<tenantId>             ORDER#<orderId>       status, total, ...
TENANT#<tenantId>             USER#<userId>         email, role, ...

Because the tenant identifier is the leading key, every access is naturally tenant-qualified, and the IAM condition has a clean attribute to bind to.

One caveat applies to secondary indexes. dynamodb:LeadingKeys constrains the base table's partition key. A local secondary index shares that partition key, so the same condition protects it — which is why AWS's own fine-grained-access examples list both the table ARN and the index ARN under a single LeadingKeys condition. A global secondary index is different: it can be partitioned on another attribute, and a Query on that GSI is keyed on the index's own partition key, which LeadingKeys does not constrain. Granting a tenant-scoped role access to a GSI whose partition key is not the tenant identifier can therefore read across tenants even when the base-table policy looks airtight. Keep the tenant identifier as the GSI's partition key, or exclude non-tenant-keyed GSIs from the tenant-scoped role's resources — and prove the boundary holds for index queries with the same cross-tenant denial tests you use for the base table.

6.2 Silo: table or account separation

In the siloed model, each tenant has its own table (named, for example, AppData-<tenantId>), and the tenant-scoped role's Resource is the tenant's table ARN rather than a shared table with a LeadingKeys condition. The boundary becomes a resource boundary, which is simpler to reason about and fails closed by construction. At the far end of the silo spectrum sits an entire AWS account per tenant; that variation, and when it is worth the operational weight, is covered in Section 10 and in AWS Multi-Account Operational Patterns.

6.3 Per-tenant encryption with AWS KMS

Encryption adds a second, cryptographic boundary that is valuable for compliance and for limiting blast radius. The two building blocks are per-tenant keys and encryption context.

Per-tenant customer managed keys (CMKs). A common, scalable pattern is one symmetric CMK per tenant, reused across the services that hold that tenant's data, rather than one key per tenant-per-service (which multiplies key count). A per-tenant key means a tenant's data can be cryptographically isolated, and revoking or scheduling deletion of that one key renders all of that tenant's ciphertext unreadable.
Encryption context as tenant-bound AAD. An encryption context is an optional set of non-secret key-value pairs supplied to a symmetric KMS operation. It is additional authenticated data (AAD): it is cryptographically bound to the ciphertext, must be supplied identically to decrypt, appears in plaintext in CloudTrail for auditability, and can be referenced in key policies and grants. Binding the tenant into the encryption context (for example {"tenant": "<tenantId>"}) means a decrypt attempt with the wrong tenant context simply fails.
Grants scoped by encryption context. A KMS grant can be constrained with EncryptionContextEquals or EncryptionContextSubset (these constraints apply only to symmetric keys), so a principal may only use the key when the request carries the matching tenant context:

aws kms create-grant \
  --key-id arn:aws:kms:ap-northeast-1:111122223333:key/<tenant-cmk> \
  --grantee-principal arn:aws:iam::111122223333:role/tenant-scoped-data-access \
  --operations Encrypt Decrypt GenerateDataKey \
  --constraints EncryptionContextEquals={tenant=<tenantId>}

Two further KMS considerations matter at the tenant level. Key rotation is per key, so per-tenant keys let you rotate (or, for higher tiers, support customer-managed bring-your-own-key arrangements) on a tenant-by-tenant basis without touching other tenants. And because every KMS operation that names an encryption context records it in CloudTrail, the encryption context doubles as an audit signal: you can see which tenant context decrypted which data, and an operation that supplies the wrong context shows up as a failure rather than a silent cross-tenant read.

6.4 Isolating objects in Amazon S3

Tenant data is rarely only in DynamoDB; documents, exports, and attachments usually live in Amazon S3, and the same ABAC discipline applies there. The classic pooled pattern is a per-tenant key prefix — s3://app-tenant-data/<tenantId>/... — with an IAM policy that pins the accessible prefix to the session's tenant tag. Two condition surfaces are involved: the object operations are scoped by putting the tag in the resource ARN, and ListBucket is scoped by constraining the s3:prefix request condition.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TenantObjectAccess",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
      "Resource": "arn:aws:s3:::app-tenant-data/${aws:PrincipalTag/TenantID}/*"
    },
    {
      "Sid": "TenantScopedList",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::app-tenant-data",
      "Condition": {
        "StringLike": { "s3:prefix": ["${aws:PrincipalTag/TenantID}/*"] }
      }
    }
  ]
}

Because the same aws:PrincipalTag/TenantID resolves the prefix at evaluation time, this one policy gives every session access to only its own tenant's prefix, just as dynamodb:LeadingKeys did for items. (Amazon S3 also supports tag-based ABAC, where buckets and the principal carry matching tags; that is an alternative to prefix scoping when you prefer to express isolation as resource tags rather than key layout.) The siloed equivalent is a bucket per tenant, scoping the resource ARN to the tenant's bucket. As with DynamoDB, the boundary is enforced by IAM against the scoped credentials — never by string-concatenating a tenant prefix in application code, which a bug could omit.

6.5 Deleting a tenant's data (offboarding)

Isolation is not only a runtime concern; it includes the end of a tenant's life cycle. When a tenant cancels, you must be able to delete its data completely. Two complementary techniques apply: deleting the tenant's items or table (straightforward in a silo, a scoped delete in a pool), and crypto-shredding — scheduling deletion of the tenant's KMS key so that any residual ciphertext, including in backups, becomes permanently undecryptable. Design offboarding into the control plane from the start; retrofitting reliable deletion is painful, and "we cannot actually delete a tenant" is a compliance problem, not a backlog item.

7. Tenant Onboarding in the Control Plane

Onboarding is where a tenant becomes real: its identity, its policies, its data resources, and its routing are all created here. The AWS Well-Architected SaaS Lens (operational excellence pillar) is explicit that onboarding should be centralized in the control plane and fully automated — a single, repeatable mechanism rather than a pile of one-off steps. There are typically two entry points into that one mechanism: a self-service sign-up (a public landing page) and an internal admin portal; both funnel into the same workflow.

7.1 The tenant registry

The control plane needs a durable record of every tenant: who they are, which tier and model they use, what status they are in, and how they map to their resources. A small DynamoDB table — the tenant registry — is the system of record that onboarding writes and the rest of the control plane reads:

tenantId   tier      model   status        resources
acme       premium   silo    ACTIVE        table=AppData-acme, kmsKey=arn:...:key/acme
globex     basic     pool    ACTIVE        table=AppData (shared)
initech    basic     pool    PROVISIONING  -

The registry is control-plane data, not tenant data, so it is not subject to the per-tenant isolation policy of the application plane; instead it is guarded by being reachable only from control-plane roles. Keeping the resource mapping here (rather than hard-coded) is what lets the same application code serve a pooled tenant and a siloed tenant: the request path looks up the tenant's table or key from its verified tenantId.

7.2 The onboarding workflow

AWS Step Functions is a natural fit because onboarding is a durable, multi-step workflow with steps that can fail and must be retried or compensated. A representative flow:

Register tenant — write the tenant record (status PROVISIONING) to the tenant registry.
Resolve tier — read the tiering policy: pooled (basic) or siloed (premium)?
Provision identity — create and configure the Cognito resource and the custom:tenantID assignment.
Provision data — pooled: nothing to create; siloed: create the tenant table and a per-tenant KMS key.
Provision policies — register the tenant context used by the isolation manager and the ABAC role.
Configure routing — map the tenant to its endpoints and usage plan.
Activate — set status ACTIVE and emit a "tenant onboarded" event.

Because tier drives provisioning, the workflow branches: a basic-tier tenant in a pooled model may create almost no new infrastructure, while a premium-tier tenant in a siloed model provisions a dedicated table and key. The tiering policy lives in the control plane, so product decisions ("premium gets a dedicated key") become workflow branches rather than scattered conditionals.

A skeletal Amazon States Language definition makes the shape concrete:

{
  "Comment": "Tenant onboarding",
  "StartAt": "RegisterTenant",
  "States": {
    "RegisterTenant": { "Type": "Task", "Resource": "arn:aws:lambda:...:register", "Next": "ResolveTier" },
    "ResolveTier": {
      "Type": "Choice",
      "Choices": [
        { "Variable": "$.tier", "StringEquals": "premium", "Next": "ProvisionSilo" }
      ],
      "Default": "ProvisionPool"
    },
    "ProvisionSilo": { "Type": "Task", "Resource": "arn:aws:lambda:...:provisionSilo", "Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "RollbackTenant" } ], "Next": "ConfigureRouting" },
    "ProvisionPool": { "Type": "Task", "Resource": "arn:aws:lambda:...:provisionPool", "Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "RollbackTenant" } ], "Next": "ConfigureRouting" },
    "ConfigureRouting": { "Type": "Task", "Resource": "arn:aws:lambda:...:configureRouting", "Next": "ActivateTenant" },
    "ActivateTenant": { "Type": "Task", "Resource": "arn:aws:lambda:...:activate", "End": true },
    "RollbackTenant": { "Type": "Task", "Resource": "arn:aws:lambda:...:rollback", "End": true }
  }
}

7.3 Why the control plane stays separate

The provisioning Lambdas in this workflow hold powerful permissions — they create tables, keys, and identity resources. That is precisely why the control plane is isolated from the request path: tenant-facing application code must never be able to invoke onboarding logic or assume its roles. Keeping onboarding in a separate plane (and ideally a separate, well-guarded account) means a compromise of the application plane cannot provision, re-tier, or delete tenants.

8. Metering and Per-Tenant Observability

To run a SaaS business you must understand per-tenant consumption — which tenant is driving load, which is approaching a tier limit, which is a candidate for a different model. The architecture must therefore carry a tenant dimension through its logs and metrics. This section is deliberately about usage, not money: capture requests, items, storage, and compute time per tenant; convert that to invoices in a separate billing system, and keep dollar figures out of the architecture.

8.1 Tenant as a first-class dimension

Three mechanisms make per-tenant usage observable:

Structured logs with tenantId. Every log line on the request path should include the resolved tenantId (from the verified context, never from raw input). This makes per-tenant queries and troubleshooting possible after the fact.
Metrics with a tenant dimension. Emit custom metrics — request counts, item counts, processing time — with TenantId as a dimension (the CloudWatch embedded metric format lets you emit metrics from structured logs without extra API calls). Be intentional about cardinality: a tenant dimension can be high-cardinality, so aggregate or sample where appropriate rather than emitting unbounded unique dimensions.
An audit trail via CloudTrail. Because the tenant arrives at the data tier as a session tag, the aws:PrincipalTag/TenantID is recorded in CloudTrail for tagged sessions. That gives you an authoritative, platform-level record of which tenant context accessed which resource — invaluable for both billing reconciliation and security investigation.

The embedded metric format (EMF) is the lowest-friction way to emit a tenant-dimensioned metric: you write a specially structured JSON log line, and CloudWatch extracts the metric from it asynchronously, so the request path pays no extra synchronous API call. A handler can emit a per-request usage metric like this:

import json

def emit_usage(tenant_id: str, operation: str, items: int):
    # CloudWatch reads this structured log line and extracts a metric
    # with TenantId as a dimension - no PutMetricData call on the hot path.
    print(json.dumps({
        "_aws": {
            "CloudWatchMetrics": [{
                "Namespace": "SaaSApp/Usage",
                "Dimensions": [["TenantId", "Operation"]],
                "Metrics": [{"Name": "ItemsProcessed", "Unit": "Count"}]
            }]
        },
        "TenantId": tenant_id,
        "Operation": operation,
        "ItemsProcessed": items
    }))

Keep dimension cardinality in mind: with many thousands of tenants, a raw per-tenant dimension can multiply the number of metric streams. A common compromise is to emit fine-grained usage to logs (queryable on demand) and reserve metric dimensions for tier or a bounded set of high-value tenants, aggregating the rest. The point is not to meter every tenant as a custom metric, but to make per-tenant usage recoverable from logs, metrics, and the audit trail together.

8.2 From usage to billing - and where to stop

A metering pipeline typically aggregates raw usage events (from logs, metrics, or a stream) into per-tenant, per-period counters in the control plane, then hands those counters to a billing or rating system. This guide stops at the boundary of that handoff. The design of the billing system itself — and any integration with a marketplace billing mechanism — is a separate concern and is only mentioned, not implemented, here. The architectural rule to remember is simple: meter usage in the platform, price it elsewhere. For current rates of any AWS service referenced in this guide, consult the official AWS Pricing pages.

9. Failure Modes, Blast Radius, and Diagnostics

A Level 400 architecture is judged by how it behaves when something goes wrong. Each failure below is given as symptom, root cause, triage, and remediation.

9.1 Suspected cross-tenant access

Symptom: a report (or an alarm on AccessDenied patterns) suggests one tenant may have seen another's data, or a penetration test reaches across the boundary.
Root cause candidates: a data-access code path that uses the function's broad execution role instead of tenant-scoped credentials; an attribute policy undermined by a missing ProjectionExpression/Select constraint (Section 5.5); a forged or mis-set tenant claim accepted because the authorizer trusted client input.
Triage: search CloudTrail for the resource access and inspect the aws:PrincipalTag/TenantID on the session that performed it; confirm whether the call used scoped credentials or the execution role; replay the request in a test account with a deliberately mismatched tenant and confirm AccessDenied.
Remediation: route the offending path through the isolation manager; tighten the execution role's direct data permissions to near-zero; add an automated cross-tenant denial test to CI so the regression cannot recur.

9.2 Noisy neighbor

Symptom: one tenant's traffic degrades latency or causes throttling for others in a pooled tier.
Root cause: shared capacity with no per-tenant ceiling — a single tenant consuming a disproportionate share of API throughput, Lambda concurrency, or DynamoDB capacity.
Triage: use the per-tenant metrics from Section 8 to identify the heavy tenant; correlate with throttling metrics on the shared resources.
Remediation: apply per-tenant rate limits at API Gateway (usage plans / throttling); consider reserved or provisioned concurrency for protected paths; for chronically heavy tenants, move them to a siloed (or higher) tier so their load cannot affect the pool — a tiering decision, not just a tuning knob.

9.3 Onboarding failure

Symptom: a tenant is stuck in PROVISIONING, or partially provisioned (identity created, data resources not).
Root cause: a provisioning step failed midway — a transient API error, a hit service quota, or a permissions gap in a provisioning Lambda.
Triage: inspect the Step Functions execution history to see exactly which state failed and why; check the tenant registry status.
Remediation: rely on the workflow's Catch/compensation path (the RollbackTenant state) to undo partial work and leave the tenant in a clean, retryable state; make each provisioning step idempotent so retries are safe; never leave a tenant half-provisioned and ACTIVE.

9.4 Key mix-up or decryption failure

Symptom: reads of a tenant's data fail with a KMS error, or (worse) a provisioning bug points a tenant at the wrong key.
Root cause: an encryption-context mismatch (the decrypt request did not supply the same tenant context used to encrypt), a grant that does not cover the operation, or a tenant-to-key mapping error during onboarding.
Triage: check CloudTrail for the KMS Decrypt call and compare the supplied encryption context and key ID against the tenant's expected values; verify the grant constraints.
Remediation: correct the tenant-to-key mapping in the registry; ensure the encryption context is always derived from the verified tenant context, not passed in by callers; treat a decryption failure as a fail-closed outcome (no data returned) rather than something to work around — a wrong-key success would be far worse than a clean failure.

9.5 Missing or wrong tenant context

Symptom: requests fail authorization unexpectedly, or a user is mapped to the wrong tenant after a federation or onboarding change.
Root cause: the custom:tenantID attribute was never set (or set late) during onboarding; the pre token generation trigger did not run or was on the wrong event version, so the https://aws.amazon.com/tags claim is absent; or the trust policy lacks sts:TagSession so the session carries no tenant tag.
Triage: decode the issued token and confirm the tags claim is present and well-formed; confirm the user pool's pre token generation trigger is configured for the V2_0 event; check the role trust policy for sts:TagSession; inspect the failing call's session in CloudTrail to see whether aws:PrincipalTag/TenantID is populated.
Remediation: make the tenant attribute a required part of onboarding and fail closed when it is missing (Section 4.2); add a guard that rejects a request whose verified context has no tenant rather than defaulting to one; never substitute a fallback tenant, which would be a silent isolation failure.

The throughline of all five is the same: per-tenant observability (Section 8) is what makes any of these diagnosable. Without a tenant dimension on your logs, metrics, and audit trail, every one of these incidents is a guessing game.

10. Variations: Per-Tenant Account Silos and When to Use Them

The strongest silo is an entire AWS account per tenant. It maximizes isolation (a hard account boundary), simplifies per-tenant cost visibility, and can satisfy strict compliance or data-residency requirements. The cost is operational: you now run many accounts and need automation to provision, govern, and observe them consistently.

This is usually reserved for premium or regulated tenants, or for products where a single very large tenant justifies dedicated infrastructure — a bridge model where most tenants are pooled and a few are account-siloed. The deep treatment of governing many accounts (Organizations, organizational units, service control policies, centralized guardrails) belongs to AWS Multi-Account Operational Patterns, so this guide only signals the decision: reach for account-per-tenant when the isolation, compliance, or blast-radius requirements clearly outweigh the operational multiplication — and keep pooling everything that does not need it.

At scale, the account-per-tenant model turns isolation into an account-provisioning and governance problem: new tenants mean new accounts created and baselined automatically, organization-wide guardrails applied uniformly, and routing that resolves a tenant to its account-specific endpoints. That is a different operational discipline from pooled multi-tenancy, and it is why account silos are usually the exception (premium, regulated, or unusually large tenants) layered on top of a pooled core rather than the default for every tenant.

For application-level authorization within a tenant — once isolation has confirmed the request is correctly scoped to a tenant, deciding whether this user may perform this action on this resource — a policy-based engine such as Amazon Verified Permissions with Cedar is the right tool; see the AWS Verified Permissions and Cedar Policy Language Complete Guide. Tenant isolation and in-tenant authorization are complementary layers, not substitutes.

11. Frequently Asked Questions

Should I start with pool or silo?

Most SaaS products start pooled for efficiency and density, and introduce silos for premium or regulated tenants as the need appears — a bridge model. Pool demands more rigor at runtime (Sections 5 and 6) because isolation is logical, not physical; silo trades efficiency for a simpler, resource-level boundary. Tier-based provisioning (Section 7) lets one onboarding workflow serve both.

How do I stop one tenant from reading another tenant's data even if my code has a bug?

Push the boundary below your code. Bind the tenant to the session as a session tag (sts:TagSession), and write an ABAC policy that constrains access with aws:PrincipalTag/TenantID — for DynamoDB, via dynamodb:LeadingKeys with the ForAllValues modifier. Then IAM, not your application, denies an out-of-scope request. The remaining discipline is ensuring every data path actually uses the scoped credentials.

Where does the tenant identity come from?

From the identity layer, not the client. In this architecture, Cognito stores custom:tenantID per user and a pre token generation (V2_0) Lambda trigger places it in the https://aws.amazon.com/tags claim, which STS maps to a session tag. The client never asserts its own tenant.

Is enabling these features enough to be "isolated"?

No. Configuration is necessary but not sufficient. Attribute restrictions can be bypassed if ProjectionExpression/Select/ReturnValues are not also constrained; any unscoped code path bypasses the whole mechanism; and a boundary you have not tried to break is unproven. Treat isolation as something to test continuously (automated cross-tenant denial tests), not a checkbox.

Do I need a separate KMS key per tenant?

Not always, but it is a strong pattern. One symmetric CMK per tenant (reused across that tenant's services) gives a cryptographic boundary and makes crypto-shredding on offboarding possible, while avoiding the key explosion of one key per tenant-per-service. Bind the tenant into the encryption context so a wrong-tenant decrypt fails closed.

How do I meter tenants without putting pricing in the architecture?

Capture usage — requests, items, storage, compute time — as a tenant dimension on logs, metrics (embedded metric format), and the CloudTrail aws:PrincipalTag audit trail. Aggregate per tenant in the control plane and hand those counters to a separate billing system. Keep dollar amounts out of the platform.

How do I completely delete a tenant when it cancels?

Design offboarding into the control plane. Delete the tenant's items or table, and schedule deletion of the tenant's KMS key (crypto-shredding) so residual ciphertext — including in backups — becomes undecryptable. Verify deletion is actually achievable before you onboard your first tenant.

12. Summary

Multi-tenant SaaS on AWS succeeds or fails on tenant isolation. This guide assembled one reference architecture — Cognito for identity, API Gateway for edge validation, Lambda for tenant-scoped logic, STS session tags and ABAC for runtime enforcement, DynamoDB with dynamodb:LeadingKeys and KMS per-tenant keys for the data tier, and Step Functions for control-plane onboarding — and walked a single request through it to show where the boundary lives and how the AWS platform enforces it even when application code is wrong.

The recurring lessons: choose pool, silo, or bridge by tier and blast-radius requirement; make the tenant identity a verified claim that flows from the token into a session tag; let IAM and STS enforce the boundary rather than your code; treat "configured" as the start of verification, not the end; and carry a tenant dimension through observability so usage can be metered and incidents can be diagnosed — all without putting cost figures into the architecture.

From here, the natural next steps in this series are the event-driven backbone that connects these services in Event-Driven Serverless Architecture on AWS and the network-and-identity perimeter that complements tenant isolation in the AWS Zero-Trust Network Architecture Guide. For the constituent building blocks, follow the delegated guides referenced throughout: Cognito federation, IAM policy evaluation, DynamoDB single-table design, multi-account patterns, and Verified Permissions with Cedar.

13. References

References:
Tech Blog with curated related content

Written by Hidekazu Konishi