Key Metrics to Monitor
Failure Rate
The percentage of jobs that fail processing. Why it matters:- Indicates bugs, external service issues, or bad data
- Sudden spikes often signal deployments or dependency problems
- Gradual increases may indicate data quality degradation
- Critical queues: < 0.5%
- Standard queues: < 1%
- Best-effort queues: < 5%
Throughput
Jobs processed per unit time. Why it matters:- Shows system capacity and utilization
- Drops indicate worker issues or upstream problems
- Spikes may indicate traffic surges or catch-up processing
- Consistent patterns during normal operation
- Expected increases during peak hours
- Unexpected drops that don’t recover
Processing Time
How long jobs take to complete. Why it matters:- Affects user experience for synchronous operations
- Indicates code efficiency and external dependency performance
- p95/p99 percentiles catch outliers that averages miss
- Average processing time
- p95 processing time (95% of jobs complete within this time)
- p99 processing time (catches edge cases)
Queue Depth (Backlog)
Number of jobs waiting to be processed. Why it matters:- Growing backlog = workers can’t keep up
- Consistent high depth = need more workers
- Sudden spikes may indicate upstream issues
- Backlog clears within acceptable timeframes
- No consistent growth over time
- Spikes recover after peak periods
Monitoring Strategy
Layered Alerting
Create alerts at different severity levels:Dashboard Review Cadence
| Frequency | What to Check |
|---|---|
| Real-time | Active incidents, triggered alerts |
| Daily | Throughput trends, failure spikes, slow jobs |
| Weekly | Processing time trends, capacity utilization |
| Monthly | Long-term trends, threshold adjustments |
Baseline Documentation
Document your normal operating parameters:Queue-Specific Strategies
High-Priority Queues
For queues handling user-facing operations (e.g., order processing):Tight Thresholds
- Failure rate: Alert at 1%
- Processing time: Alert at 2x baseline
- Backlog: Alert at 10x normal
Fast Response
- Short cooldown periods (5 min)
- Multiple notification channels
- Immediate escalation paths
Background Job Queues
For queues handling non-critical operations (e.g., analytics, cleanup):Relaxed Thresholds
- Failure rate: Alert at 5%
- Processing time: Alert at 5x baseline
- Longer time windows
Batched Notifications
- Longer cooldown (1-4 hours)
- Daily summary emails
- Lower priority channels
Data Pipeline Queues
For queues handling data processing:Accuracy Focus
- Very low failure tolerance (< 0.1%)
- Monitoring for data quality issues
- Tracking completion rates
Performance Tracking
- Processing time percentiles
- Throughput vs expected volume
- End-to-end pipeline timing
Common Patterns and Solutions
Pattern: Gradual Backlog Growth
Symptoms:- Queue depth slowly increases over days/weeks
- No sudden spikes
- Insufficient workers for growing load
- Processing time slowly increasing
- Worker efficiency decreasing
- Add more workers
- Profile and optimize slow operations
- Review recent code changes
Pattern: Periodic Failure Spikes
Symptoms:- Failures spike at regular intervals
- Spikes correlate with specific times
- Scheduled jobs overwhelming resources
- External service maintenance windows
- Traffic patterns
- Stagger scheduled jobs
- Coordinate with external service schedules
- Auto-scale workers during peak times
Pattern: Sudden Processing Time Increase
Symptoms:- Processing time jumps suddenly
- No corresponding code changes
- External service degradation
- Database performance issues
- Network problems
- Check external service status
- Review database queries and indexes
- Check network connectivity
Pattern: Intermittent Worker Disconnects
Symptoms:- Workers frequently disconnect/reconnect
- “Missing workers” alerts that auto-resolve
- Memory issues causing worker crashes
- Container orchestration issues
- Network instability
- Review worker memory usage
- Check container health
- Stabilize network connection
Observability Checklist
Use this checklist to ensure comprehensive monitoring:Metrics Collection
Metrics Collection
- Throughput (completed/failed per hour)
- Processing time (avg, p95, p99)
- Queue depth (waiting, active, delayed)
- Worker count and status
- Failure rate percentage
Alerting
Alerting
- Failure rate alert
- Backlog alert
- Processing time alert
- Missing workers alert
- Alert recipients configured
- Escalation paths defined
Documentation
Documentation
- Baseline metrics documented
- Runbooks for common issues
- On-call rotation defined
- Incident response process
Regular Reviews
Regular Reviews
- Daily metric check
- Weekly trend review
- Monthly threshold adjustment
- Quarterly capacity planning