cloudwatch
💡 摘要
一个用于使用CloudWatch监控AWS资源的综合技能,涵盖日志、指标、警报和仪表板。
🎯 适合人群
🤖 AI 吐槽: “这本质上是一份组织良好的CloudWatch命令速查表,证明有时最好的技能就是知道如何阅读手册。”
该技能执行AWS CLI命令和boto3调用,需要具有广泛CloudWatch权限的IAM凭证。主要风险是如果代理环境被破坏,会导致凭证泄露或权限提升。缓解措施:使用临时的、范围受限的最小权限IAM策略,而非长期访问密钥。
name: cloudwatch description: AWS CloudWatch monitoring for logs, metrics, alarms, and dashboards. Use when setting up monitoring, creating alarms, querying logs with Insights, configuring metric filters, building dashboards, or troubleshooting application issues. last_updated: "2026-01-07" doc_source: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/
AWS CloudWatch
Amazon CloudWatch provides monitoring and observability for AWS resources and applications. It collects metrics, logs, and events, enabling you to monitor, troubleshoot, and optimize your AWS environment.
Table of Contents
Core Concepts
Metrics
Time-ordered data points published to CloudWatch. Key components:
- Namespace: Container for metrics (e.g.,
AWS/Lambda) - Metric name: Name of the measurement (e.g.,
Invocations) - Dimensions: Name-value pairs for filtering (e.g.,
FunctionName=MyFunc) - Statistics: Aggregations (Sum, Average, Min, Max, SampleCount, pN)
Logs
Log data from AWS services and applications:
- Log groups: Collections of log streams
- Log streams: Sequences of log events from same source
- Log events: Individual log entries with timestamp and message
Alarms
Automated actions based on metric thresholds:
- States: OK, ALARM, INSUFFICIENT_DATA
- Actions: SNS notifications, Auto Scaling, EC2 actions
Common Patterns
Create a Metric Alarm
AWS CLI:
# CPU utilization alarm for EC2 aws cloudwatch put-metric-alarm \ --alarm-name "HighCPU-i-1234567890abcdef0" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --ok-actions arn:aws:sns:us-east-1:123456789012:alerts
boto3:
import boto3 cloudwatch = boto3.client('cloudwatch') cloudwatch.put_metric_alarm( AlarmName='HighCPU-i-1234567890abcdef0', MetricName='CPUUtilization', Namespace='AWS/EC2', Statistic='Average', Period=300, Threshold=80.0, ComparisonOperator='GreaterThanThreshold', EvaluationPeriods=2, Dimensions=[ {'Name': 'InstanceId', 'Value': 'i-1234567890abcdef0'} ], AlarmActions=['arn:aws:sns:us-east-1:123456789012:alerts'], OKActions=['arn:aws:sns:us-east-1:123456789012:alerts'] )
Lambda Error Rate Alarm
aws cloudwatch put-metric-alarm \ --alarm-name "LambdaErrorRate-MyFunction" \ --metrics '[ { "Id": "errors", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Errors", "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}] }, "Period": 60, "Stat": "Sum" }, "ReturnData": false }, { "Id": "invocations", "MetricStat": { "Metric": { "Namespace": "AWS/Lambda", "MetricName": "Invocations", "Dimensions": [{"Name": "FunctionName", "Value": "MyFunction"}] }, "Period": 60, "Stat": "Sum" }, "ReturnData": false }, { "Id": "errorRate", "Expression": "errors/invocations*100", "Label": "Error Rate", "ReturnData": true } ]' \ --threshold 5 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 3 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts
Query Logs with Insights
# Find errors in Lambda logs aws logs start-query \ --log-group-name /aws/lambda/MyFunction \ --start-time $(date -d '1 hour ago' +%s) \ --end-time $(date +%s) \ --query-string ' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 ' # Get query results aws logs get-query-results --query-id <query-id>
boto3:
import boto3 import time logs = boto3.client('logs') # Start query response = logs.start_query( logGroupName='/aws/lambda/MyFunction', startTime=int(time.time()) - 3600, endTime=int(time.time()), queryString=''' fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50 ''' ) query_id = response['queryId'] # Wait for results while True: result = logs.get_query_results(queryId=query_id) if result['status'] == 'Complete': break time.sleep(1) for row in result['results']: print(row)
Create Metric Filter
Extract metrics from log patterns:
# Create metric filter for error count aws logs put-metric-filter \ --log-group-name /aws/lambda/MyFunction \ --filter-name ErrorCount \ --filter-pattern "ERROR" \ --metric-transformations \ metricName=ErrorCount,metricNamespace=MyApp,metricValue=1,defaultValue=0
Publish Custom Metrics
import boto3 cloudwatch = boto3.client('cloudwatch') cloudwatch.put_metric_data( Namespace='MyApp', MetricData=[ { 'MetricName': 'OrdersProcessed', 'Value': 1, 'Unit': 'Count', 'Dimensions': [ {'Name': 'Environment', 'Value': 'Production'}, {'Name': 'OrderType', 'Value': 'Standard'} ] } ] )
Create Dashboard
cat > dashboard.json << 'EOF' { "widgets": [ { "type": "metric", "x": 0, "y": 0, "width": 12, "height": 6, "properties": { "title": "Lambda Invocations", "metrics": [ ["AWS/Lambda", "Invocations", "FunctionName", "MyFunction"] ], "period": 60, "stat": "Sum", "region": "us-east-1" } }, { "type": "log", "x": 12, "y": 0, "width": 12, "height": 6, "properties": { "title": "Recent Errors", "query": "SOURCE '/aws/lambda/MyFunction' | filter @message like /ERROR/ | limit 20", "region": "us-east-1" } } ] } EOF aws cloudwatch put-dashboard \ --dashboard-name MyAppDashboard \ --dashboard-body file://dashboard.json
CLI Reference
Metrics Commands
| Command | Description |
|---------|-------------|
| aws cloudwatch put-metric-data | Publish custom metrics |
| aws cloudwatch get-metric-data | Retrieve metric values |
| aws cloudwatch get-metric-statistics | Get aggregated statistics |
| aws cloudwatch list-metrics | List available metrics |
Alarms Commands
| Command | Description |
|---------|-------------|
| aws cloudwatch put-metric-alarm | Create or update alarm |
| aws cloudwatch describe-alarms | List alarms |
| aws cloudwatch set-alarm-state | Manually set alarm state |
| aws cloudwatch delete-alarms | Delete alarms |
Logs Commands
| Command | Description |
|---------|-------------|
| aws logs create-log-group | Create log group |
| aws logs put-log-events | Write log events |
| aws logs filter-log-events | Search log events |
| aws logs start-query | Start Insights query |
| aws logs put-metric-filter | Create metric filter |
| aws logs put-retention-policy | Set log retention |
Best Practices
Metrics
- Use dimensions wisely — too many creates metric explosion
- Aggregate before publishing — batch custom metrics
- Use high-resolution metrics (1-second) only when needed
- Set meaningful units for custom metrics
Alarms
- Use composite alarms for complex conditions
- Set appropriate evaluation periods to avoid flapping
- Include OK actions to track recovery
- Use anomaly detection for dynamic thresholds
Logs
- Set retention policies — don't keep logs forever
- Use structured logging (JSON) for better querying
- Create metric filters for key events
- Use Contributor Insights for top-N analysis
Cost Optimization
- Delete unused dashboards
- Reduce log retention for non-critical logs
- Avoid high-resolution metrics unless necessary
- Use log subscription filters instead of polling
Troubleshooting
Missing Metrics
Causes:
- Service not publishing yet (wait 1-5 minutes)
- Wrong namespace/dimensions
- Detailed monitoring not enabled (EC2)
Debug:
# List metrics for a namespace aws cloudwatch list-metrics \ --namespace AWS/Lambda \ --dimensions Name=FunctionName,Value=MyFunction
Alarm Stuck in INSUFFICIENT_DATA
Causes:
- Metric not being published
- Dimensions mismatch
- Evaluation period too short
Debug:
# Check if metric has data aws cloudwatch get-metric-statistics \ --namespace AWS/Lambda \ --metric-name Invocations \ --dimensions Name=FunctionName,Value=MyFunction \ --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Sum
Log Events Not Appearing
Causes:
- IAM permissions missing
- CloudWatch Logs agent not running
- Log group doesn't exist
Debug:
# Check log streams aws logs describe-log-streams \ --log-group-name /aws/lambda/MyFunction \ --order-by LastEventTime \ --descending \ --limit 5
High CloudWatch Costs
Check usage:
# Get PutLogEvents usage aws cloudwatch get-metric-statistics \ --namespace AWS/Logs \ --metric-name IncomingBytes \ --dimensions Name=LogGroupName,Value=/aws/lambda/MyFunction \ --start-time $(date -d '7 days ago' -u +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 86400 \ --statistics Sum
References
- CloudWatch User Guide
- [CloudWa
优点
- 涵盖CloudWatch功能的广泛范围并提供实用示例。
- 提供清晰的CLI和boto3代码片段,便于立即实施。
- 包含针对实际使用场景的故障排除指南和最佳实践。
缺点
- 本质上是一个文档包装器,缺乏新颖的自动化功能。
- 假设用户已预先设置好AWS CLI/boto3并具备IAM知识。
- 可以从更高级的自动化或集成模式中受益。
相关技能
免责声明:本内容来源于 GitHub 开源项目,仅供展示和评分分析使用。
版权归原作者所有 itsmostafa.
