Lambda + SQS: idempotency, DLQ, partial batch failure

The consumer in Article 09 only logs, and it's naive about two facts of real asynchronous processing. First, distributed messaging systems usually deliver at least once, so the same event can arrive twice; if you just increment the counter on every receipt you count wrong. Second, when processing fails, where does the event go, how many times is it retried, and if it still fails does it vanish? This article turns the consumer into a real aggregator that solves both.

Goal

Insert an SQS queue between EventBridge and Lambda for batching, retry, and a dead-letter queue. Count clicks into DynamoDB so that duplicate events don't count wrong (idempotent), attach a DLQ so failed events aren't lost, and report failures per message so only the broken one is retried. We test all three for real: counting correctly, a duplicate counting once, and a failing event landing in the DLQ.

Why insert SQS in the middle

Article 09 wired EventBridge straight into Lambda. Adding an SQS queue in the middle gives us three things the straight wire doesn't have out of the box. SQS gathers messages into a batch so Lambda processes several at once, which is cheaper. It holds messages and retries on a schedule if processing fails, instead of losing them. And it has a dead-letter queue: after a number of failed attempts, the message is moved to a separate queue for investigation later, rather than stuck forever on the main queue.

   resolve ──PutEvents──▶ EventBridge bus
                              │ rule matches link.clicked
                              ▼
                          ClickQueue (SQS) ──▶ ClickAggregator (Lambda, batch)
                              │                      │ writes DynamoDB (transaction)
                              │ after 3 failures     │
                              ▼                      ▼
                          ClickDLQ (SQS)        META.clicks += 1
                          (investigate later)  STAT#<day>.count += 1
                                                CLICK#<eventId> (idempotency marker)

In SAM, the EventBridge rule points its target to SQS, and Lambda receives from SQS:

ClickQueue:
  Type: AWS::SQS::Queue
  Properties:
    VisibilityTimeout: 30
    RedrivePolicy:
      deadLetterTargetArn: !GetAtt ClickDLQ.Arn
      maxReceiveCount: 3

maxReceiveCount: 3 means a message is attempted at most three times; on the fourth attempt, if it still hasn't been processed, SQS moves it to ClickDLQ.

Idempotent counting with a transaction

The idempotent counting part is where you need to be most careful. Every EventBridge event carries a unique id. The idea: write a marker CLICK#<id> to mark "this event has been processed", and only allow writing the marker if it doesn't already exist. If you combine writing the marker with incrementing the counter into the same transaction, then incrementing the counter and marking-as-processed happen together or neither happens:

await ddb.send(
  new TransactWriteCommand({
    TransactItems: [
      {
        Update: {
          TableName: TABLE,
          Key: { PK: `LINK#${code}`, SK: "META" },
          UpdateExpression: "ADD clicks :one",
          ExpressionAttributeValues: { ":one": 1 },
        },
      },
      {
        Update: {
          TableName: TABLE,
          Key: { PK: `LINK#${code}`, SK: `STAT#${date}` },
          UpdateExpression: "ADD #c :one",
          ExpressionAttributeNames: { "#c": "count" },
          ExpressionAttributeValues: { ":one": 1 },
        },
      },
      {
        Put: {
          TableName: TABLE,
          Item: { PK: `CLICK#${id}`, SK: "CLICK", ttl },
          ConditionExpression: "attribute_not_exists(PK)",
        },
      },
    ],
  })
);

This transaction does three things: increment the total counter on the META item, increment the day counter on the STAT#<day> item, and write the marker with the condition attribute_not_exists(PK). If the event arrives a second time, the marker already exists, the condition fails, and DynamoDB cancels the entire transaction. No counter gets incremented. The command then throws TransactionCanceledException, and the handler understands that as a duplicate event so it safely skips:

if ((e as { name?: string }).name === "TransactionCanceledException") {
  console.log("duplicate, skip", { id });
  continue;
}

The marker carries a ttl attribute set 24 hours out, and the table has Time To Live enabled on that attribute, so DynamoDB deletes old markers itself and the table doesn't grow forever. This is exactly the pattern that the AWS Lambda Powertools library packages in its Idempotency utility; here we do it directly with a transaction to see the mechanism clearly, while Powertools offers a shorter way to write it if you don't want to manage the marker yourself.

Reporting failures per message

When Lambda receives a batch of several messages from SQS, if one message fails and you report the whole batch as failed, the good messages also get retried, producing duplicate processing. The correct approach is to report each failed message. The docs describe this mechanism: the function "catches failures for individual messages and returns their identifiers in a batchItemFailures response, signaling Lambda to retry only the failed messages rather than the entire batch."

const batchItemFailures: SQSBatchItemFailure[] = [];
for (const record of event.Records) {
  try {
    // ...process record...
  } catch (err) {
    batchItemFailures.push({ itemIdentifier: record.messageId });
  }
}
return { batchItemFailures };

To enable this mechanism, the event source declares FunctionResponseTypes: [ReportBatchItemFailures]. A message reported as failed goes back to the queue and is retried; the other messages in the batch are considered done.

Real test: counting and per-day analysis

Create a link and open it three times via the API. The events go EventBridge to SQS to the aggregator, and the counter rises:

$ for i in 1 2 3; do curl -s -o /dev/null "$API/$CODE"; done
$ aws dynamodb get-item --table-name url-shortener \
    --key '{"PK":{"S":"LINK#7knfT08"},"SK":{"S":"META"}}' --query 'Item.clicks.N'
"3"

Because the aggregator also increments STAT#<day>, we have per-day numbers ready. The table below is a snapshot of the link's item collection after also running the idempotency part in the next section (that section sends another event dated 2026-05-26), so the total counter is already 4 and there's an extra STAT#2026-05-26 row:

$ aws dynamodb query --table-name url-shortener \
    --key-condition-expression "PK = :pk" \
    --expression-attribute-values '{":pk":{"S":"LINK#7knfT08"}}' \
    --query 'Items[].{SK:SK.S,clicks:clicks.N,count:count.N}' --output table
+------------------+----------+--------+
|        SK        | clicks   | count  |
+------------------+----------+--------+
|  META            |  4       |  None  |
|  STAT#2026-05-25 |  None    |  3     |
|  STAT#2026-05-26 |  None    |  1     |
+------------------+----------+--------+

The day counter uses the date part (UTC) of the click time as the sort key, so a later dashboard charting over time is just querying the item collection and filtering the STAT# rows.

Real test: idempotency

To force the "same event arrives twice" situation, send a message with a fixed id straight into the queue, twice:

BODY='{"id":"fixed-test-id-xyz","detail":{"code":"7knfT08","at":"2026-05-26T10:00:00Z"}}'
aws sqs send-message --queue-url "$QURL" --message-body "$BODY"
aws sqs send-message --queue-url "$QURL" --message-body "$BODY"

The total counter rises by exactly one, from 3 to 4, even though we sent it twice. And in the table above, STAT#2026-05-26 is 1, not 2. The second send is blocked by the CLICK#fixed-test-id-xyz marker right inside the transaction, so it doesn't increment. Double-counting is eliminated at the root.

Real test: dead-letter queue

Send a malformed event, missing code, to make the handler throw:

aws sqs send-message --queue-url "$QURL" \
  --message-body '{"id":"bad-event-1","detail":{"at":"2026-05-25T10:00:00Z"}}'

Watching the DLQ, the failed message shows up after about 90 seconds, exactly three attempts spaced by a 30-second visibility cycle:

  t=80s: DLQ has 0 messages
  t=90s: DLQ has 1 message
$ aws sqs receive-message --queue-url "$DLQURL" --query 'Messages[0].Body'
{"id":"bad-event-1","detail":{"at":"2026-05-25T10:00:00Z"}}

The malformed event isn't stuck forever on the main queue, nor does it vanish. After three failed attempts, SQS moves it intact to the DLQ, where we can read it back to find the cause, fix it, then re-push if needed. And the good events keep being processed normally throughout, because failures are reported per message.

🧹 Cleanup

# delete the demo link + its STAT items
for sk in META STAT#2026-05-25 STAT#2026-05-26; do
  aws dynamodb delete-item --table-name url-shortener \
    --key "{\"PK\":{\"S\":\"LINK#$CODE\"},\"SK\":{\"S\":\"$sk\"}}"
done
aws cognito-idp admin-delete-user --user-pool-id "$POOL" --username agg@example.com
aws sqs purge-queue --queue-url "$DLQURL"

The remaining CLICK#<id> markers expire on their own via TTL. Keep the stack for the next article.

Wrap-up

The consumer is now a real aggregator. Inserting SQS in the middle gives us batching, retry, and a DLQ. Counting with a transaction ties the counter increment to an idempotency marker, so a duplicate-delivered event doesn't count wrong. Report failures per message so only the broken one is retried, and after a set number of attempts the failing event lands in the DLQ instead of vanishing. The total counter and the per-day counter both sit in the link's item collection, ready for a dashboard.

Click data is now accurate and per-day, but the dashboard still has to ask the server again to learn the new numbers. The next article does the realtime part: using API Gateway's WebSocket API so that when a click is counted, the server pushes the new number down to a browser that has the dashboard open, with no page reload.

Counting Clicks Safely: Idempotency, DLQ, and Partial Batch Failure