CloudWatch alarms, dashboard, SLOs for a serverless API

The previous article gave the system logs, traces, and metrics. But observability data only has value if someone looks at it, and nobody sits watching a dashboard 24/7. You need to turn those indicators into automatic alerts: when something crosses a threshold, the system raises an alert on its own, instead of waiting for users to complain. This article defines those thresholds via SLOs, builds the alarms, and pushes one into a real alarm state.

Goal

Define a few SLOs for the API, build CloudWatch alarms wired to SNS that alert when errors or throttles cross a threshold, gather the metrics into a dashboard, and run a Logs Insights query to compute an indicator from logs. Then push a real alarm into ALARM. Alarm and dashboard costs at this scale are negligible.

SLO and SLI: setting the threshold for "good"

Before alerting, you have to define what a healthy system is. An SLI (Service Level Indicator) is a measure of service quality, for example the success rate of requests, or p99 latency. An SLO (Service Level Objective) is the target set for that SLI, for example "99.9% of link opens succeed" or "p99 resolve latency under 200 ms". The SLO gives you an objective boundary: below the threshold means action is needed, and that threshold is where the alarm goes.

For the URL shortener, two reasonable SLOs for the hot path are: reliability (resolve doesn't error, isn't throttled) and latency (p99 within an acceptable range). We build alarms reflecting them.

Alarms wired to SNS

An alarm watches a metric, compares it against a threshold over a window, and when it crosses, it switches to ALARM state and fires an action — here, sending to an SNS topic (where email, Slack, or PagerDuty get wired in later):

AlarmTopic:
  Type: AWS::SNS::Topic

ResolveThrottleAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Namespace: AWS/Lambda
    MetricName: Throttles
    Dimensions:
      - { Name: FunctionName, Value: !Ref ResolveLinkFunction }
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    TreatMissingData: notBreaching
    AlarmActions: [!Ref AlarmTopic]

TreatMissingData: notBreaching matters: when there are no requests, the throttle metric has no data, and we want the alarm to treat that as normal rather than alerting. A second alarm watches the resolver's Errors the same way. Each alarm is an SLO encoded into a concrete threshold.

Dashboard

A dashboard gathers metrics in one place for a quick look. Declared as JSON in the template, with widgets for invocations/errors/throttles, p50/p99 latency, and the custom LinkResolved metric from the previous article:

Dashboard:
  Type: AWS::CloudWatch::Dashboard
  Properties:
    DashboardName: url-shortener
    DashboardBody: !Sub |
      { "widgets": [
        { "type":"metric", ..., "metrics":[
          ["AWS/Lambda","Invocations","FunctionName","${ResolveLinkFunction}",{"stat":"Sum"}],
          ["AWS/Lambda","Errors",...], ["AWS/Lambda","Throttles",...] ]},
        { ..., "metrics":[["AWS/Lambda","Duration",...,{"stat":"p99"}]] },
        { ..., "metrics":[["UrlShortener","LinkResolved",{"stat":"Sum"}]] }
      ]}

Putting the dashboard in the template means it's versioned with the code, not built by hand on the console and then forgotten who changed what.

Triggering a real alarm

To see the alarm work rather than just exist, we create the condition for it. This account has a Lambda concurrency limit of 10 (seen in Article 06), so firing 60 link-open requests nearly simultaneously will cause throttles. After firing, watch the alarm state:

waiting for the throttle alarm to go ALARM...
  t=15s: INSUFFICIENT_DATA
  t=30s: ALARM

$ aws cloudwatch describe-alarms --alarm-names url-shortener-resolve-throttles \
    --query 'MetricAlarms[0].{State:StateValue,Reason:StateReason}'
Threshold Crossed: 1 datapoint [23.0 (25/05/26 17:31:00)] was greater than or
equal to the threshold (1.0).    ALARM

The alarm switches to ALARM within about half a minute, with the reason spelled out: one datapoint of 23 throttles crossed the threshold of 1. At the same time it sends a notification to the SNS topic. When load drops and throttles stop, the alarm returns to OK on its own. This is the closed loop: a real incident (exceeding concurrency) produces a metric, the metric crosses the SLO threshold, the alarm changes state and notifies SNS.

Logs Insights: computing indicators from logs

Not every SLI exists as a ready-made metric. With the structured logs from the previous article, Logs Insights lets you query and compute directly over the logs. For example, count resolve logs by level:

$ aws logs start-query --log-group-name "/aws/lambda/...ResolveLinkFunction..." \
    --query-string 'fields @message | filter ispresent(level) | stats count(*) as n by level'
...
level   INFO
n       8

Eight INFO lines in the query window — only requests that actually reached the handler produce a log; the throttled ones never get there, so they don't appear (this number is much smaller than the total requests fired, exactly as expected when most are rejected). Because the logs are structured, you can query by level, by code, or join with xray_trace_id, and compute indicators that built-in metrics don't have. Logs Insights is how you answer specific questions about system behavior without adding a new metric.

🧹 Cleanup

aws dynamodb scan --table-name url-shortener --query 'Items[].{PK:PK.S,SK:SK.S}' --output text | \
  while read pk sk; do aws dynamodb delete-item --table-name url-shortener \
    --key "{\"PK\":{\"S\":\"$pk\"},\"SK\":{\"S\":\"$sk\"}}"; done
aws cognito-idp admin-delete-user --user-pool-id "$POOL" --username cw@example.com

The alarm returns to OK on its own when throttles stop. Keep the stack for the next article.

Wrap-up

The system now alerts itself when there's a problem. SLOs define what healthy means, alarms encode those thresholds and fire to SNS when crossed, the dashboard gathers metrics for a quick look, and Logs Insights computes the specific indicators from structured logs. The test showed a real alarm switching to ALARM when resolve was throttled, the full loop from incident to alert.

That throttle alarm just reiterated a pain point: concurrency limits and cold start. The next article goes back to exactly that topic to optimize for real. We'll measure resolve's cold start, then compare ways to reduce it: SnapStart (now supporting Python, Java, and .NET), provisioned concurrency, and cutting package size, to see what each one trades off.

CloudWatch: Alarms, Dashboards, and SLOs for the API