AWS Step Functions Distributed Map - Practical Patterns and Pitfalls for Large-Scale Parallel Workloads

First Published: 2026-04-28
Last Updated: 2026-04-28

A practitioner's guide to AWS Step Functions Distributed Map: Item Source taxonomy (S3 Inventory, CSV, JSON, JSONL, Parquet), ItemBatcher and ResultWriter, MaxConcurrency tuning, five production patterns, cost math against Standard and Express child workflows, and the pitfalls that bite at million-item scale.

I have previously published a series of articles on AWS Step Functions approval-flow patterns. This article shifts the focus from human-in-the-loop orchestration to the other extreme — fully unattended, large-scale parallel processing using the Distributed Map state.

If you would like to validate or visualise the ASL fragments in this article, the Step Functions ASL Validator and Visualizer tool I built may be useful.

1. Introduction — Why Distributed Map Exists

The original Step Functions Map state — now called Inline Map — runs all iterations inside the parent execution. That design imposes a set of hard limits that anyone who has tried to scale Map past a few thousand items has hit:

All iterations share the parent execution's 25,000-event history limit; once that ceiling is hit, the workflow is forced to terminate.
Inline Map caps MaxConcurrency at 40 — if you need more than 40 simultaneous iterations, Distributed Map is the only option.
The iterated array must fit inside the 256 KiB state input.
There is no native way to source items directly from Amazon S3 — the array has to be assembled in-memory upstream.

Distributed Map was announced on 2022-12-01 to remove these ceilings. Each iteration is launched as a separate child execution with its own event history, the dataset can be sourced directly from Amazon S3 (no need to fit in state input), and concurrency scales up to 10,000 simultaneous child executions per Map Run.

This article skips Step Functions fundamentals and goes straight into the design decisions that matter once you commit to Distributed Map: choosing the item source, sizing concurrency, batching, aggregating results, controlling cost, and handling the failure modes that only appear at million-item scale.

2. Distributed Map Anatomy

2.1 Map Run as a First-Class Resource

When a Distributed Map state is reached, Step Functions creates a Map Run — an AWS-managed resource with its own ARN that lives alongside the parent execution:

arn:aws:states:<region>:<account>:mapRun:<state-machine-name>/<map-label-or-uuid>:<run-uuid>

The Map Run aggregates child execution status, exposes its own CloudWatch metrics (with the Map Run ARN as the dimension), and shows up as a dedicated page in the Step Functions console with a SUCCEEDED / FAILED / PENDING breakdown of children. From the parent state machine's perspective, the Map state itself contributes only its own enter and exit transitions to the parent's execution history — the per-item state transitions of the children are recorded in each child execution's own history (Standard) or CloudWatch Logs (Express), not in the parent. This keeps the parent execution well below the 25,000-event history ceiling that bites Inline Map.

2.2 Item Source Taxonomy

Distributed Map can pull items from eight different sources. Choosing the right one is usually the most important design decision.

Source	`Resource`	`InputType`	When to use
State input JSON array	(none — uses `ItemsPath`)	n/a	Up to 256 KiB worth of items already produced by an earlier state
S3 ListObjectsV2	`arn:aws:states:::s3:listObjectsV2`	n/a (object metadata)	Process every object under a prefix — small to medium prefixes
S3 GetObject — JSON	`arn:aws:states:::s3:getObject`	`JSON`	A single JSON file containing an array (use `ItemsPointer` for nested)
S3 GetObject — JSONL	`arn:aws:states:::s3:getObject`	`JSONL`	One JSON object per line; ideal for streaming exports
S3 GetObject — CSV	`arn:aws:states:::s3:getObject`	`CSV`	Tabular data, header in `FIRST_ROW` or `GIVEN`
S3 GetObject — PARQUET	`arn:aws:states:::s3:getObject`	`PARQUET`	Analytics exports; supports Snappy/GZIP/ZSTD compression
S3 Inventory manifest	`arn:aws:states:::s3:getObject`	`MANIFEST` (item type assumed CSV)	Hundreds of millions of S3 objects — this is the killer source
Athena UNLOAD output	`arn:aws:states:::s3:getObject`	`MANIFEST` (see note)	Use Athena to filter/shape the input set first

Note on Athena UNLOAD: The InputType field itself is set to MANIFEST, exactly as for the S3 Inventory case. The Athena-specific behaviour is opted into through two additional ReaderConfig fields: ManifestType set to ATHENA_DATA, and a separate InputType-shaped field declaring the data file format the Athena query produced (CSV, JSONL, or PARQUET). Together this reads the Athena-generated manifest as the index and parses the referenced data files using the declared format. See the ItemReader documentation for the exact field schema.

Two limits to memorize:

A single item read from CSV/JSON/JSONL can be up to 8 MB, but the input passed to a child execution is still capped at 256 KiB — use ItemSelector to project a smaller payload before fan-out.
MaxItems can cap the dataset at up to 100,000,000 items. The default is to read everything.

2.3 Standard vs Express Child Workflows

Distributed Map is the only place where you choose the workflow type per child execution rather than per state machine:

"ItemProcessor": {
  "ProcessorConfig": {
    "Mode": "DISTRIBUTED",
    "ExecutionType": "EXPRESS"
  },
  ...
}

Pick EXPRESS for short, idempotent, high-volume work — at-least-once semantics, up to 5 minutes maximum duration, charged per request and duration. The cost reduction versus Standard is dramatic at scale (see section 10). Express child executions also have no native execution history in Step Functions: the canonical record is in CloudWatch Logs, so plan for log-based observability from day one.

Pick STANDARD when any one of the following applies: you need exactly-once semantics, executions longer than five minutes, the Wait for Callback pattern (.waitForTaskToken), or the Run a Job pattern (.sync / .sync:2). Express Workflows are limited to the Request Response integration pattern and explicitly do not support .sync or .waitForTaskToken, per the Standard vs Express comparison — so any child task that uses those patterns forces Standard.

3. Setup — Minimum Viable Distributed Map

The skeleton ASL below is the smallest fragment that runs. Note Mode: DISTRIBUTED and ExecutionType are both required inside ProcessorConfig.

{
  "Comment": "Minimum Distributed Map skeleton",
  "StartAt": "ProcessAll",
  "States": {
    "ProcessAll": {
      "Type": "Map",
      "Label": "ProcessAll",
      "ItemReader": {
        "Resource": "arn:aws:states:::s3:listObjectsV2",
        "Parameters": { "Bucket": "my-input-bucket", "Prefix": "incoming/" }
      },
      "ItemProcessor": {
        "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
        "StartAt": "DoWork",
        "States": {
          "DoWork": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": {
              "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:process-object",
              "Payload.$": "$"
            },
            "End": true
          }
        }
      },
      "MaxConcurrency": 200,
      "End": true
    }
  }
}

The state machine role must allow the Map Run to launch and inspect child executions, plus read/write the relevant S3 locations:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["states:StartExecution"],
      "Resource": "arn:aws:states:ap-northeast-1:123456789012:stateMachine:my-state-machine"
    },
    {
      "Effect": "Allow",
      "Action": ["states:DescribeExecution", "states:StopExecution"],
      "Resource": "arn:aws:states:ap-northeast-1:123456789012:execution:my-state-machine:*"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::my-input-bucket"
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::my-input-bucket/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload"
      ],
      "Resource": "arn:aws:s3:::my-results-bucket/*"
    }
  ]
}

The four S3 actions on the results bucket are all required: Step Functions writes ResultWriter output via S3 multipart upload and reads back its own manifest, so s3:GetObject, s3:ListMultipartUploadParts, and s3:AbortMultipartUpload must accompany s3:PutObject — this is documented in the ResultWriter IAM section. Granting only s3:PutObject works at the smoke-test scale but breaks once a result file crosses the 5 GB rollover threshold (see section 11) and Step Functions starts a multipart upload.

Two scoping nuances in this policy that often trip people up:

states:StartExecution is scoped to the same state machine. Map Run launches each child as a new execution under the same state machine ARN as the parent — the child execution starts at the ItemProcessor's StartAt state (not at the parent's StartAt) and receives the per-item ItemSelector payload as its input. Because the child execution ARN belongs to the same state machine, the principal needs states:StartExecution on its own stateMachine:my-state-machine ARN — not a different state machine. If you tighten the resource further (for example, by mistake, to a wildcard SM name that does not exist) every child execution will fail with States.TaskFailed.
Express child executions use a different ARN prefix. When ExecutionType is EXPRESS, child execution ARNs use the express: prefix instead of execution:. The minimum policy above is sufficient for Standard children; if you fan out to Express children and you need to call DescribeExecution on them from outside the Map (for example, from a sibling Lambda), add a second statement with "Resource": "arn:aws:states:ap-northeast-1:123456789012:express:my-state-machine:*". Note that Express execution history is not stored in Step Functions itself — DescribeExecution on an Express ARN is limited compared to Standard, and the canonical record lives in CloudWatch Logs (see section 13).

If you use a customer-managed KMS key for either bucket, also add kms:Decrypt and kms:GenerateDataKey for that key — this is the single most common cause of "AccessDenied" errors after a successful first deploy.

4. Pattern 1 — Process S3 Inventory at Million-Object Scale

ListObjectsV2 works but issues one paged API call per 1,000 keys. For buckets with tens of millions of objects, S3 Inventory is dramatically faster: a single GetObject for the manifest, then one GetObject per inventory file. Schedule the inventory daily/weekly and let Distributed Map consume it.

{
  "Type": "Map",
  "Label": "ProcessInventory",
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:getObject",
    "ReaderConfig": {
      "InputType": "MANIFEST"
    },
    "Parameters": {
      "Bucket": "my-inventory-bucket",
      "Key": "config-id/2026-04-25T00-00Z/manifest.json"
    }
  },
  "ItemSelector": {
    "bucket.$": "$.Bucket",
    "key.$": "$.Key",
    "size.$": "$.Size"
  },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "ScanObject",
    "States": {
      "ScanObject": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": {
          "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:scan-object",
          "Payload.$": "$"
        },
        "End": true
      }
    }
  },
  "MaxConcurrency": 1000,
  "ToleratedFailurePercentage": 1,
  "End": true
}

ItemSelector is doing real work here: each inventory row contains a long list of columns, and we project only the three fields the Lambda actually needs. This keeps the child input well below the 256 KiB cap and shrinks Lambda invoke payload size — relevant once you are paying per-millisecond.

5. Pattern 2 — CSV Batch Processing with `ItemBatcher`

Invoking a Lambda once per CSV row is wasteful when each row's work takes only a few milliseconds. ItemBatcher packs N rows into one child invocation and amortises both Step Functions state-transition cost and Lambda cold-start cost.

{
  "Type": "Map",
  "Label": "BatchedCsv",
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:getObject",
    "ReaderConfig": {
      "InputType": "CSV",
      "CSVHeaderLocation": "FIRST_ROW"
    },
    "Parameters": {
      "Bucket": "my-input-bucket",
      "Key": "ratings/ratings-2026-04.csv"
    }
  },
  "ItemBatcher": {
    "MaxItemsPerBatch": 500,
    "MaxInputBytesPerBatch": 262144,
    "BatchInput": { "runId.$": "$$.Execution.Name" }
  },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "ProcessBatch",
    "States": {
      "ProcessBatch": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": {
          "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:process-batch",
          "Payload.$": "$"
        },
        "End": true
      }
    }
  },
  "MaxConcurrency": 100,
  "End": true
}

MaxInputBytesPerBatch defends the 256 KiB child input limit — Distributed Map will close a batch early if adding another item would exceed the byte cap. BatchInput adds a constant payload (here, the parent execution name as a runId) to every batch, which is useful for traceability and idempotency keys downstream.

6. Pattern 3 — Paginated API Mirroring with an Inline Driver

Sometimes the dataset is not in S3 yet — it lives behind a paginated REST API. The clean separation is to use an Inline Map (or Choice loop) as a driver that pages through the API, writes each page as a JSONL file to S3, then invokes a Distributed Map per page. This keeps the slow, sequential paging out of the hot fan-out path.

{
  "Comment": "Driver writes pages, Distributed Map fans out per file",
  "StartAt": "FetchPage",
  "States": {
    "FetchPage": {
      "Type": "Task",
      "Resource": "arn:aws:states:::http:invoke",
      "Parameters": {
        "ApiEndpoint": "https://api.example.com/orders",
        "Method": "GET",
        "Authentication": {
          "ConnectionArn": "arn:aws:events:ap-northeast-1:123456789012:connection/example-orders-api/abc123"
        },
        "QueryParameters": {
          "cursor.$": "$.cursor"
        }
      },
      "ResultPath": "$.page",
      "Next": "PersistPage"
    },
    "PersistPage": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:write-jsonl",
        "Payload.$": "$.page"
      },
      "ResultPath": "$.persisted",
      "Next": "FanOutPage"
    },
    "FanOutPage": {
      "Type": "Map",
      "Label": "FanOutPage",
      "ItemReader": {
        "Resource": "arn:aws:states:::s3:getObject",
        "ReaderConfig": { "InputType": "JSONL" },
        "Parameters": {
          "Bucket": "my-staging-bucket",
          "Key.$": "$.persisted.Payload.key"
        }
      },
      "ItemProcessor": {
        "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
        "StartAt": "ProcessOrder",
        "States": {
          "ProcessOrder": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Parameters": {
              "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:process-order",
              "Payload.$": "$"
            },
            "End": true
          }
        }
      },
      "MaxConcurrency": 200,
      "ResultPath": null,
      "Next": "MorePages"
    },
    "MorePages": {
      "Type": "Choice",
      "Choices": [
        { "Variable": "$.page.NextCursor", "IsPresent": true, "Next": "FetchPage" }
      ],
      "Default": "Done"
    },
    "Done": { "Type": "Succeed" }
  }
}

The pattern leans on three things: HTTP endpoints (Step Functions calls third-party APIs natively), JSONL as a streaming-friendly intermediate format, and ResultPath: null on the Map state so the Choice can still see the cursor.

Initial cursor is the bear trap. The first call to FetchPage happens with whatever input the state machine was started with, so $.cursor must already exist or States.Format / the http:invoke Parameters block will fail at runtime. Either start the execution with {"cursor": null} and let the API treat null as "first page", or insert a one-shot Pass state ahead of FetchPage that sets "cursor.$": "States.JsonMerge($, {\"cursor\": null}, false).cursor". The cleanest production form is the explicit Pass initialiser — relying on a "the API treats absent as first page" assumption breaks the first time you swap the upstream API.

http:invoke requires an EventBridge Connection to manage the API credentials — the Authentication.ConnectionArn field is mandatory, not optional. The state machine role needs three additional permissions to use it: events:RetrieveConnectionCredentials on the connection ARN, plus secretsmanager:GetSecretValue and secretsmanager:DescribeSecret on the auto-created secret that backs the connection (arn:aws:secretsmanager:<region>:<account>:secret:events!connection/*). Forgetting these is the most common reason an HTTP Task fails with Events.ConnectionResource.InvalidConnectionState on the very first run.

7. Pattern 4 — Multi-Tenant Fan-Out with `ItemSelector`

When the same workflow runs for many tenants, you want one parent execution per business event but you do not want the tenant context to be re-fetched in every child. ItemSelector injects context at fan-out time:

{
  "Type": "Map",
  "Label": "PerTenantBackup",
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:listObjectsV2",
    "Parameters": {
      "Bucket": "my-tenants-config",
      "Prefix": "tenants/"
    }
  },
  "ItemSelector": {
    "tenantId.$": "$$.Map.Item.Value.Key",
    "auditRunId.$": "$$.Execution.Name",
    "snapshotDate.$": "$$.Execution.Input.snapshotDate",
    "kmsKeyArn.$": "$$.Execution.Input.kmsKeyArn"
  },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "STANDARD" },
    "StartAt": "BackupTenant",
    "States": {
      "BackupTenant": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
        "Parameters": {
          "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:long-backup",
          "Payload": {
            "context.$": "$",
            "taskToken.$": "$$.Task.Token"
          }
        },
        "HeartbeatSeconds": 300,
        "End": true
      }
    }
  },
  "MaxConcurrency": 50,
  "ToleratedFailurePercentage": 0,
  "End": true
}

Three things to note: STANDARD is required because the child uses .waitForTaskToken, which Express Workflows do not support (Express is restricted to the Request Response integration pattern, per the Standard vs Express comparison); the Map iterates over tenant config files but the ItemSelector flattens the tenant key plus parent context into a clean per-child payload; and MaxConcurrency is deliberately low — backup work tends to be I/O-heavy on shared resources. For long-running .waitForTaskToken tasks, HeartbeatSeconds is essential: without it, a worker that crashes silently after receiving the task token will leave the child execution waiting indefinitely until the parent execution times out; set HeartbeatSeconds to a value slightly larger than the expected interval between worker heartbeat calls so Step Functions can detect and fail a stalled child promptly.

Path scope inside ItemSelector is a subtle source of bugs. Inside ItemSelector, the bare $ refers to the current item being processed (the same data you can also reach via $$.Map.Item.Value) — not to the parent execution's input. To pull values out of the parent execution's initial input, you must use the Context Object form $$.Execution.Input.<field>, as snapshotDate and kmsKeyArn do above. Writing "$.snapshotDate" here would silently resolve against each item's data instead of the parent input and either evaluate to null or fail with States.Runtime, depending on the item's shape.

8. Pattern 5 — ML Inference Bulk Job (Express Children + Bedrock)

For embarrassingly parallel inference — embedding generation, classification, summarisation — Distributed Map with Express children and a model invocation in each child is hard to beat. ToleratedFailureCount lets the run absorb a small number of throttling failures without failing the whole job.

{
  "Type": "Map",
  "Label": "BulkSummarise",
  "ItemReader": {
    "Resource": "arn:aws:states:::s3:getObject",
    "ReaderConfig": { "InputType": "JSONL" },
    "Parameters": {
      "Bucket": "my-input-bucket",
      "Key": "documents/2026-04/manifest.jsonl"
    }
  },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "Summarise",
    "States": {
      "Summarise": {
        "Type": "Task",
        "Resource": "arn:aws:states:::bedrock:invokeModel",
        "Parameters": {
          "ModelId": "us.anthropic.claude-3-5-haiku-20241022-v1:0",
          "Body": {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 512,
            "messages": [
              { "role": "user", "content.$": "States.Format('Summarise: {}', $.text)" }
            ]
          }
        },
        "Retry": [
          {
            "ErrorEquals": [
              "Bedrock.ThrottlingException",
              "ThrottlingException",
              "Bedrock.ModelTimeoutException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 6,
            "BackoffRate": 2.0,
            "JitterStrategy": "FULL"
          },
          {
            "ErrorEquals": ["States.TaskFailed"],
            "IntervalSeconds": 5,
            "MaxAttempts": 2,
            "BackoffRate": 2.0,
            "JitterStrategy": "FULL"
          }
        ],
        "Catch": [
          {
            "ErrorEquals": ["States.ALL"],
            "ResultPath": "$.error",
            "Next": "RecordDeadLetter"
          }
        ],
        "End": true
      },
      "RecordDeadLetter": {
        "Type": "Task",
        "Resource": "arn:aws:states:::sqs:sendMessage",
        "Parameters": {
          "QueueUrl": "https://sqs.ap-northeast-1.amazonaws.com/123456789012/distmap-dlq",
          "MessageBody": {
            "item.$": "$",
            "runId.$": "$$.Execution.Name",
            "stateEnteredTime.$": "$$.State.EnteredTime"
          }
        },
        "End": true
      }
    }
  },
  "TimeoutSeconds": 86400,
  "MaxConcurrency": 50,
  "ToleratedFailureCount": 100,
  "ResultWriter": {
    "Resource": "arn:aws:states:::s3:putObject",
    "Parameters": {
      "Bucket": "my-results-bucket",
      "Prefix": "summaries/2026-04/"
    },
    "WriterConfig": { "OutputType": "JSONL", "Transformation": "FLATTEN" }
  },
  "End": true
}

MaxConcurrency: 50 is deliberately conservative — Bedrock per-account model RPS is the real ceiling, not Step Functions. Always tune downward from your model's quotas, not upward from Step Functions defaults.

The two-tier Retry is intentional. Service-integration error names follow the <ServiceName>.<ExceptionName> convention but are not exhaustively contractual — the safest pattern is to enumerate the known throttle and timeout names with aggressive backoff, then add a narrower States.TaskFailed catch-all with fewer attempts so a genuinely-broken model invocation does not retry forever. Adjust MaxAttempts and IntervalSeconds based on your provisioned throughput and acceptable per-item latency.

The Catch + dedicated RecordDeadLetter state is the missing half of most Distributed Map error-handling. Retry alone only buys time; once retries are exhausted Distributed Map records the failure against ToleratedFailureCount and moves on, but it does not preserve the original item payload anywhere a human can later inspect it. Catching States.ALL with ResultPath: "$.error" preserves both the original item and the error envelope, and forwarding the pair to an SQS dead-letter queue (or DynamoDB, or an EventBridge bus) gives operators a durable record without inflating Map Run failure metrics. This pattern pairs well with Redrive (section 12) — redrive handles transient infrastructure failures, the DLQ handles data-shaped failures.

TimeoutSeconds: 86400 on the Map state caps the entire Map Run at 24 hours; without it a stuck child can hold the parent execution open until the Standard 1-year limit. Use a value comfortably above your expected end-to-end runtime, but not so generous that a runaway run goes unnoticed. HeartbeatSeconds at the child task level (shown in Pattern 4) handles the orthogonal case of an individual child stalling.

The ModelId uses the us. Cross-Region Inference (CRIS) profile, which routes inference traffic across multiple US regions for capacity. From a Tokyo (ap-northeast-1) state machine this means the request leaves the Asia Pacific region; if your data is subject to residency constraints, swap to a region-bound model ID such as anthropic.claude-3-5-haiku-20241022-v1:0 in ap-northeast-1 or use the apac. CRIS profile. The latency cost of cross-region routing is usually negligible for batch summarisation but can matter for synchronous use cases.

9. Concurrency Tuning

Three knobs control fan-out behaviour:

MaxConcurrency — hard ceiling on simultaneously running child executions. If unset or set to 0, Step Functions runs up to 10,000 children in parallel — the per-Map-Run hard quota. Set it explicitly, even when you intend "as fast as possible", because it is the only contract that protects downstream services.
ToleratedFailurePercentage (0–100) — proportion of failed children to tolerate before the Map Run is marked FAILED with States.ExceedToleratedFailureThreshold. Default is 0, which is almost always the wrong production value — with the default, a single transient failure (a Lambda cold-start timeout, a DynamoDB hot-partition throttle) anywhere in millions of items aborts the entire Map Run. Set this explicitly to a non-zero number for any idempotent retry-safe workload, even if it is only 1.
ToleratedFailureCount — absolute count alternative to the percentage. Use one or the other; if both are set, the Map Run fails when either threshold is crossed.

Dynamic sizing. Each of the three above has a JsonPath sibling that reads the value from the parent input at runtime: MaxConcurrencyPath, ToleratedFailurePercentagePath, ToleratedFailureCountPath. ItemBatcher exposes the same pair (MaxItemsPerBatchPath, MaxInputBytesPerBatchPath). Use them when the same state machine has to size itself per tenant or per dataset without redeploying.

A practical tuning recipe:

1. Start from the downstream limit, not from Step Functions. If the child invokes a Lambda with reserved concurrency 200, set MaxConcurrency to 150–180 to leave headroom.
2. Cap percent failure for idempotent retry-safe work (e.g. 5%) so a few retriable errors do not blow up a multi-hour run.
3. Use absolute count for safety-critical jobs where a single missed item is meaningful.
4. Add Retry blocks at the child task level for transient errors; ToleratedFailure* only counts children that exhausted their retries.

Be aware that the parent's own Retry policy on the Map state, if defined, will start a new Map Run and re-execute every child — successful and failed alike. This is rarely what you want. Prefer per-child Retry plus per-Map-Run ToleratedFailureCount, and use Redrive (section 12) to restart only the children that actually failed.

10. Cost Analysis — State Transitions Add Up Fast

Step Functions Standard Workflows are billed at $0.025 per 1,000 state transitions in us-east-1; other regions start at the same base rate and may be slightly higher. Counting basis: a state transition is incurred each time a state in the workflow starts; in practice this is approximately equal to the number of states executed in the run, including Pass, Choice, and Wait states — not just Task states. Express Workflows are billed per-request plus a duration-and-memory fee: $1.00 per million requests plus $0.00001667 per GB-second. Note that the workflow memory dimension here is computed from the state machine definition size and per-step execution data, not from a configurable setting like Lambda's memory; Step Functions rounds the consumed memory up to the next 64 MB increment, so a 100 MB workload is charged as 128 MB and tightening per-state payload size lowers the per-execution bill at high item counts. (Pricing verified 2026-04 on the AWS Step Functions pricing page; cross-check for your specific region as rates can vary and change over time.)

For a Distributed Map with N children, each child counts its own state transitions. A child workflow of three states processed once per item over 10 million items is 30 million transitions (3 state-starts × 10M children) — in Standard that is roughly $750 just for the orchestration, before any Lambda or downstream cost. The same job with EXPRESS children replaces transitions with request charges and is typically one to two orders of magnitude cheaper, depending on per-child duration. The parent's own Map state contributes only one or two transitions per Map Run regardless of N, so the dominant cost driver is the per-child state count.

Three rules of thumb for cost-controlled Distributed Map design:

1. Default to EXPRESS children unless you genuinely need Standard semantics.
2. Batch with ItemBatcher — 500 items per batch turns 10M transitions into 20K.
3. Keep child workflows shallow. Each Task in the child runs once per item; collapse glue states into a single Lambda when you can.

11. Result Aggregation with `ResultWriter`

Without ResultWriter, Distributed Map returns the array of all child outputs as the state output — and trips the 256 KiB output cap almost immediately on real workloads.

ResultWriter consolidates outputs to S3 and is the only sensible default at scale:

"ResultWriter": {
  "Resource": "arn:aws:states:::s3:putObject",
  "Parameters": {
    "Bucket": "my-results-bucket",
    "Prefix": "summaries/2026-04/"
  },
  "WriterConfig": {
    "OutputType": "JSONL",
    "Transformation": "FLATTEN"
  }
}

Step Functions writes a manifest.json plus result files per status — SUCCEEDED_0.json, FAILED_0.json, and (only if the Map Run was stopped or aborted while children were still in-flight) PENDING_0.json — each containing the input, output, ARN, and status of every child execution in that bucket. Result files carry the .json extension; the trailing index increments (_1, _2, …) once the rolling file size exceeds 5 GB. (Verified 2026-04-26 at ResultWriter documentation.) Downstream you typically read the manifest and process only the failed-execution files for redrive.

OutputType may be JSON or JSONL. Transformation accepts three values: NONE returns each child output wrapped with its full workflow metadata (ARN, status, input, output) and is the default when ResultWriter is set with an S3 destination but WriterConfig is omitted; COMPACT returns just the child outputs and is the default when ResultWriter is omitted entirely (state-output mode — nothing is written to S3); FLATTEN additionally unrolls any array each child returns so the destination has one child output per line. Most downstream consumers want FLATTEN when each child returns an array of records and COMPACT when each child returns a single record — choose NONE only when you need the original execution metadata downstream (for example, audit logs). (Verified 2026-04-26 against the ResultWriter documentation.)

12. Redrive — Restarting Failed Children Without Re-Running the Whole Map

The pre-2024 way to deal with partial failure was either to put a Retry on the Map state itself (re-runs every child, not just the failed ones — doubles the bill) or to script a custom redrive job that read the ResultWriter manifest and re-fed the failed items into a new execution. Redrive (GA 2023-11-15) replaces both: it restarts only the unsuccessful child executions of an existing Map Run, in place, with the same parent execution ARN and the same Map Run ARN. (Verified 2026-04-26 at Redrive Map Run documentation.)

Trigger it via the RedriveExecution API on the parent execution:

aws stepfunctions redrive-execution \
  --execution-arn arn:aws:states:ap-northeast-1:123456789012:execution:my-state-machine:run-id

What gets re-executed depends on the child workflow type:

EXPRESS children — failed and timed-out children are restarted as new executions via StartExecution. Successful children are skipped. Individual redrive attempts share the same ARN; enable CloudWatch Logs on the state machine if you need to distinguish them.
STANDARD children — failed, timed-out, or aborted children are resumed via RedriveExecution from the state that failed, not from the beginning. Each redrive attempt is recorded distinctly in GetExecutionHistory.

Eligibility rules worth memorising before you bake redrive into a runbook:

Parent execution must have started on or after 2023-11-15.
The parent must still be redrivable: at most 1,000 cumulative redrives, execution not yet expired (Standard executions are retained for 90 days).
Standard child workflows must not have exceeded the 25,000-event history limit.
Map Runs that failed with States.DataLimitExceeded or JSON interpolation errors are not redrivable — treat those failures as terminal and fix the data or the ASL.

The required IAM action is states:RedriveExecution on the parent execution ARN (note: the resource is the parent execution, not the Map Run):

{
  "Effect": "Allow",
  "Action": ["states:RedriveExecution"],
  "Resource": "arn:aws:states:ap-northeast-1:123456789012:execution:my-state-machine:*"
}

In the Step Functions console, the Map Run page shows a Redrive count column and the execution history records a MapRunRedriven event each time the API is called. Children left pending because of MaxConcurrency are placed in a "Pending redrive" state and pick up automatically as concurrency frees up.

Operational pattern: emit an EventBridge rule on the Map Run FAILED status, pipe it to a small Lambda that reads FAILED_n.json from the ResultWriter output, decides whether the failure mode is transient (throttling, timeout) or terminal (validation, schema), and either calls RedriveExecution or escalates. This pattern replaces a surprising amount of bespoke retry-and-DLQ scaffolding.

13. Observability — CloudWatch Map Run Metrics

Step Functions emits a dedicated metrics namespace with the Map Run ARN as the dimension. The metrics worth alarming on:

ExecutionsStarted / ExecutionsSucceeded / ExecutionsFailed / ExecutionsAborted — per Map Run
ItemsProcessed — total items consumed from the source
ExecutionThrottled — children rejected because the account-level execution quota was exceeded
ChildWorkflowsTimedOut — common signal of downstream throttling

In the Step Functions console, every Map Run gets a dedicated page that lists children with status and lets you drill into any individual execution history. This is by far the fastest way to debug a partial failure: filter for FAILED, click into one, look at the cause string. AWS X-Ray tracing can be enabled on the parent state machine for both Standard and Express child workflows; with Standard children the trace follows into the child execution history natively, while for Express children traces are surfaced through the CloudWatch Logs that Express writes to (Express does not retain its own execution history, so log-based observability is the canonical record — see the CloudWatch Logs note below).

Express children require CloudWatch Logs to be debuggable. Unlike Standard executions, Express child executions have no execution history in Step Functions itself; GetExecutionHistory returns nothing. The only way to inspect what happened inside an Express child is the CloudWatch Logs group attached to the state machine via LoggingConfiguration. Set Level: ALL and IncludeExecutionData: true for at least the first weeks of a new Distributed Map deployment; once you are confident in the Lambda-side logs you can dial back to ERROR to control log volume.

14. Integration Choices and Infrastructure as Code

14.1 Optimised vs SDK Service Integrations

Step Functions exposes two flavours of service integration. The choice affects ASL ergonomics, parameter shape, and the failure modes you have to plan for, especially inside a Distributed Map child where retries multiply.

Optimised integrations — arn:aws:states:::lambda:invoke, arn:aws:states:::s3:getObject, arn:aws:states:::bedrock:invokeModel, etc. Step Functions provides parameter normalisation, retries on transport-level errors, and a predictable error-name vocabulary (Lambda.ServiceException, Bedrock.ThrottlingException, etc.). Inputs use the higher-level shapes (e.g. FunctionName + Payload), and the result is unwrapped automatically. Use these whenever the optimised version exists for the call you need.
AWS SDK integrations — arn:aws:states:::aws-sdk:<service>:<action>. These pass the parameters straight through to the AWS SDK, with no normalisation. Use them when you need an action that has no optimised variant (most Bedrock control-plane APIs, IAM, EC2-DescribeInstances, etc.) or when you need a parameter that the optimised integration does not surface.

A concrete contrast for Lambda inside a Distributed Map child:

{
  "Optimised": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
      "FunctionName": "arn:aws:lambda:ap-northeast-1:123456789012:function:process",
      "Payload.$": "$"
    },
    "Retry": [
      { "ErrorEquals": ["Lambda.TooManyRequestsException", "Lambda.ServiceException"], "IntervalSeconds": 1, "MaxAttempts": 6, "BackoffRate": 2.0, "JitterStrategy": "FULL" }
    ],
    "End": true
  },
  "Sdk": {
    "Type": "Task",
    "Resource": "arn:aws:states:::aws-sdk:lambda:invoke",
    "Parameters": {
      "FunctionName": "process",
      "Payload.$": "$",
      "InvocationType": "RequestResponse"
    },
    "End": true
  }
}

The optimised lambda:invoke automatically wraps the response, retries on the throttle exceptions Lambda actually returns, and emits errors under the Lambda.* namespace that other state machines elsewhere in the codebase will already have Retry blocks for. The SDK form is more flexible but you have to write the retry vocabulary yourself; at million-item scale the inconsistency between two sibling Map definitions becomes a real maintenance burden. Default to optimised; reach for SDK only when the optimised form does not expose what you need.

14.2 Map-Level `TimeoutSeconds` as a Safety Valve

Distributed Map inherits the parent state machine's overall timeout, but it also accepts TimeoutSeconds on the Map state itself, capping the entire Map Run independently of how long individual children take. The pattern that prevents the worst surprises:

{
  "Type": "Map",
  "Label": "BulkProcess",
  "TimeoutSeconds": 14400,
  "ItemReader": { "...": "..." },
  "ItemProcessor": {
    "ProcessorConfig": { "Mode": "DISTRIBUTED", "ExecutionType": "EXPRESS" },
    "StartAt": "Work",
    "States": {
      "Work": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": { "FunctionName": "process", "Payload.$": "$" },
        "TimeoutSeconds": 60,
        "End": true
      }
    }
  },
  "MaxConcurrency": 100,
  "End": true
}

Two layers, two purposes:

Map-level TimeoutSeconds bounds the whole Map Run. Pick a number that is "obviously too long" for a healthy run but short enough that a stuck Map Run gets noticed within an on-call rotation — for a 30-minute job, 4 hours (14,400 s) is a reasonable safety net.
Task-level TimeoutSeconds bounds each child Task. Pair this with HeartbeatSeconds on .waitForTaskToken tasks (Pattern 4) so a stalled worker fails fast. Without these, a child can stay RUNNING until the Standard 1-year ceiling.

Map Runs that hit the Map-level timeout fail with States.Timeout, are eligible for Redrive, and emit the ChildWorkflowsTimedOut metric. Both are infinitely better signals than "the run is still going six hours later, no idea why".

14.3 Defining Distributed Map in CDK and SAM

ASL is the source of truth, but most production state machines are deployed via Infrastructure as Code. Two patterns work well for Distributed Map:

AWS CDK (TypeScript) — raw ASL via DefinitionBody:

import * as sfn from "aws-cdk-lib/aws-stepfunctions";
import * as iam from "aws-cdk-lib/aws-iam";

const stateMachine = new sfn.StateMachine(this, "BulkProcessor", {
  stateMachineType: sfn.StateMachineType.STANDARD,
  definitionBody: sfn.DefinitionBody.fromFile("asl/bulk_processor.asl.json"),
  tracingEnabled: true,
  logs: {
    destination: new logs.LogGroup(this, "BulkProcessorLogs", {
      retention: logs.RetentionDays.ONE_MONTH,
    }),
    level: sfn.LogLevel.ALL,
    includeExecutionData: true,
  },
});

// Distributed Map fans out by starting child executions of the same state machine.
stateMachine.grantStartExecution(stateMachine.role);
// DescribeExecution / StopExecution on child executions.
stateMachine.role.addToPrincipalPolicy(new iam.PolicyStatement({
  actions: ["states:DescribeExecution", "states:StopExecution"],
  resources: [`arn:aws:states:${this.region}:${this.account}:execution:${stateMachine.stateMachineName}:*`],
}));

The two-step IAM grant captures exactly the policy from section 3, expressed as CDK. Keeping the ASL in a separate .asl.json file (as opposed to sfn.DefinitionBody.fromString(...)) means you can validate it against the Step Functions ASL Validator and Visualizer in CI before deployment.

AWS SAM — AWS::Serverless::StateMachine with DefinitionUri:

Resources:
  BulkProcessor:
    Type: AWS::Serverless::StateMachine
    Properties:
      Type: STANDARD
      DefinitionUri: asl/bulk_processor.asl.json
      DefinitionSubstitutions:
        InputBucket: !Ref InputBucket
        ResultsBucket: !Ref ResultsBucket
      Tracing: { Enabled: true }
      Logging:
        Level: ALL
        IncludeExecutionData: true
        Destinations:
          - CloudWatchLogsLogGroup: { LogGroupArn: !GetAtt LogGroup.Arn }
      Policies:
        - StepFunctionsExecutionPolicy: { StateMachineName: !Ref AWS::StackName }
        - S3ReadPolicy: { BucketName: !Ref InputBucket }
        - S3CrudPolicy: { BucketName: !Ref ResultsBucket }
        - Statement:
            - Effect: Allow
              Action: ["states:DescribeExecution", "states:StopExecution"]
              Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${AWS::StackName}-BulkProcessor*:*"

DefinitionSubstitutions is the SAM equivalent of CDK SubstitutionTokens — it lets the ASL stay a static file with placeholders that resolve at deploy time. Either pattern keeps the ASL source-controlled, IDE-formattable, and visualisable in the Step Functions console workflow studio without round-tripping through generated code.

15. Common Pitfalls

Lambda concurrency starvation. A MaxConcurrency of 1,000 will happily try to run 1,000 Lambda invocations in parallel. If your function does not have reserved concurrency configured, it will compete with every other function in the account for the unreserved pool. The fix is reserved concurrency on the function plus MaxConcurrency set just below it — and an alarm on Throttles for the function.

256 KiB child input vs 8 MB item read. You can read items up to 8 MB from a JSON/CSV/JSONL source, but the input passed to a child execution is still capped at 256 KiB. Always pair ItemReader with an ItemSelector that projects only the fields the child actually needs.

Same-account, same-Region buckets only. ItemReader and ResultWriter require S3 buckets in the same AWS account and Region as the state machine. Cross-account fan-in/out has to be staged via copy. (Verified 2026-04-26 against the AWS Step Functions developer guide.)

Customer-managed KMS keys. If either bucket uses SSE-KMS with a CMK, the state machine role needs kms:Decrypt on the read side and kms:GenerateDataKey on the write side. Distributed Map will not surface this clearly — you will see generic AccessDenied per child.

Map-level Retry re-runs every child. A Retry block on the Map state itself launches an entirely new Map Run on retry, re-executing every item, succeeded or failed. Use it only when the failure mode is genuinely "the whole batch was bad". For per-item resilience, put Retry on the child task and use ToleratedFailurePercentage/ToleratedFailureCount at the Map level — and reach for Redrive (section 12) when you only want to restart the children that actually failed.

Parquet has its own constraints. Parquet input requires row-group ≤ 256 MB and footer ≤ 5 MB; VersionId is not supported. If your existing Parquet pipeline writes giant row groups, repartition before Distributed Map can read it. (Verified 2026-04-26 against the AWS Step Functions developer guide.)

Account-level child execution quotas. The 10,000 default MaxConcurrency does not exempt you from per-account StartExecution rate limits or open-execution caps. At very high concurrency you will see ExecutionThrottled — request a quota increase in advance for predictable bulk jobs. Exact current quotas (verified 2026-04 on docs.aws.amazon.com/step-functions/latest/dg/service-quotas.html): open Standard Workflow executions limit is 1,000,000 per account per region; StartExecution rate in us-east-1/us-west-2/eu-west-1 is 1,300 burst / 300 per second sustained; Distributed Map: max 1,000 open Map Runs and max 10,000 parallel child executions (hard quota).

16. Summary

Distributed Map turns Step Functions into a viable orchestrator for workloads that previously needed a custom queue-and-fleet design. The five patterns here cover the bulk of what teams build in practice — S3 inventory sweep, batched CSV processing, paginated API mirror, multi-tenant fan-out, and bulk inference. Combine them with ItemBatcher for cost control, ItemSelector for payload hygiene, ResultWriter for clean S3 outputs, and ToleratedFailurePercentage for resilience, and you have a serverless replacement for most of what used to require Step Functions plus SQS plus a lot of glue.

The patterns that go wrong tend to do so silently: a missing KMS permission shows up as per-child AccessDenied, a default MaxConcurrency exhausts Lambda's unreserved pool, a Map-level Retry doubles your bill on a flaky run. Treat the section 15 pitfalls as a pre-deploy checklist for anything past a thousand items, and adopt Redrive (section 12) as the standard way to recover from partial failures rather than rerunning the entire job.

hidekazu-konishi.com