CloudWatch Metric Guide
CloudWatch Metric Guide
Every important CloudWatch metric explained — what it measures, recommended alarm thresholds, common failures, and how to debug it when it fires. Covers 20 metrics across 6 AWS services.
20
metrics covered
6
AWS services
9
verification checks
RDS
Amazon RDS
4 metrics covered
DatabaseConnections
DatabaseConnections counts the number of client network connections open to the RDS instance at the time of sampling. It reflects both active and idle connections held by your application's connection pool.
Threshold
≥ 80% of your RDS instance's max_connections parameter value
FreeableMemory
FreeableMemory reports the amount of available random access memory on the RDS instance, in bytes. It includes memory in the OS free pool plus reclaimable cached and buffered memory.
Threshold
< 25% of total instance RAM (or a specific byte floor for your instance class)
CPUUtilization
CPUUtilization measures the percentage of CPU capacity consumed by the RDS instance across all available vCPUs. It is the aggregate across all cores.
Threshold
> 80% sustained for 5 minutes or more
FreeStorageSpace
FreeStorageSpace reports the available storage capacity on the RDS instance's EBS volume, in bytes. When this reaches zero, the database stops accepting writes.
Threshold
< 20% of total allocated storage, or < 5 GB absolute floor (whichever is larger)
LAMBDA
AWS Lambda
4 metrics covered
Errors
Errors counts the number of Lambda invocations that resulted in a function error — including exceptions thrown by the function code and runtime errors (timeout, out-of-memory, handler not found).
Threshold
> 0 in any 5-minute window for critical functions; > N errors/minute for high-volume functions (set N based on your acceptable error rate)
Duration
Duration measures the elapsed wall-clock time from when the Lambda function handler begins executing to when it returns or times out. CloudWatch publishes minimum, maximum, average, and percentile statistics.
Threshold
p99 Duration > 80% of the function's configured timeout
Throttles
Throttles counts invocation requests that Lambda rejected because concurrent execution limits were reached. A throttled invocation was not executed — the caller receives a 429 TooManyRequestsException.
Threshold
> 0 for critical functions; for high-volume functions, a low absolute count threshold based on your acceptable drop rate
ConcurrentExecutions
ConcurrentExecutions reports the number of Lambda function instances actively processing events at any given time, across the entire account or per function when filtered by function name.
Threshold
> 800 concurrent executions at the account level (for accounts at the default 1,000 limit)
ECS
Amazon ECS
3 metrics covered
CPUUtilization
CPUUtilization for ECS measures the percentage of CPU units reserved by the tasks in a service that are in use, averaged across the tasks in the service.
Threshold
> 85% sustained for 3 minutes
MemoryUtilization
MemoryUtilization for ECS measures the percentage of memory reserved by the tasks in a service that is in use, averaged across the running tasks.
Threshold
> 85% averaged across the service
RunningTaskCount
RunningTaskCount reports the number of tasks in the RUNNING state for an ECS service. Tasks not in RUNNING are either pending, provisioning, deprovisioning, or stopped.
Threshold
< desired task count for the service
ALB
Application Load Balancer
3 metrics covered
HTTPCode_ELB_5XX_Count
HTTPCode_ELB_5XX_Count counts HTTP 5XX response codes generated by the load balancer itself — not by the registered targets. These indicate the ALB could not deliver the request to a healthy target.
Threshold
> 0 in any 5-minute window
TargetResponseTime
TargetResponseTime measures the time elapsed from when the ALB sent the request to a registered target until the target started sending a response, in seconds. CloudWatch exposes this as p50, p90, p95, and p99 percentiles.
Threshold
p99 TargetResponseTime > your defined SLA threshold (typically 1s for APIs, 3s for pages)
UnHealthyHostCount
UnHealthyHostCount reports the number of targets (EC2 instances, ECS tasks, Lambda functions) registered with the ALB target group that are currently failing health checks.
Threshold
> 0 (any unhealthy host in a production target group)
EC2
Amazon EC2
4 metrics covered
CPUUtilization
CPUUtilization measures the percentage of allocated EC2 compute units (vCPUs) that are in use on the instance, as reported by the hypervisor.
Threshold
> 80% for 15 consecutive minutes
NetworkIn
NetworkIn measures the number of bytes received by the instance on all network interfaces during the CloudWatch measurement period.
Threshold
Anomaly detection recommended over fixed threshold — alert when NetworkIn exceeds 3 standard deviations above the historical baseline for the same time of day
NetworkOut
NetworkOut measures the number of bytes sent by the instance on all network interfaces during the CloudWatch measurement period.
Threshold
Anomaly detection recommended — alert when NetworkOut exceeds 3 standard deviations above baseline for the same time of day and day of week
StatusCheckFailed
StatusCheckFailed combines the results of both the instance status check (instance software and network configuration) and the system status check (underlying AWS host hardware). A value of 1 means at least one of these checks has failed.
Threshold
> 0 — alarm immediately
DYNAMODB
Amazon DynamoDB
2 metrics covered
ConsumedReadCapacityUnits
ConsumedReadCapacityUnits reports the number of read capacity units consumed over the specified time period for a DynamoDB table or global secondary index.
Threshold
> 80% of provisioned read capacity (for provisioned mode tables)
ThrottledRequests
ThrottledRequests counts requests to DynamoDB that were throttled because the request rate exceeded the provisioned throughput limits for the table or index.
Threshold
> 0 for provisioned capacity tables; > 0 as a cost and latency alert for on-demand tables
Audit your setup
Not sure which of these you’re missing?
The free Nuberio Audit scans your CloudWatch setup in 5 minutes and identifies missing alarms, noisy alarms, and unmonitored resources across all the services on this page. No credit card. Read-only access.
Finding too many gaps? Run a free CloudWatch audit to get a hygiene score across your entire account — all missing alarms surfaced in one report with copy-paste CLI fixes.
Want Nuberio to catch anomalies on metrics with no alarm? See how Nuberio Watch works.