Skip to main content
Effective monitoring helps you catch issues before they impact users, understand system performance, and make data-driven decisions. This guide covers best practices for monitoring BullMQ queues with bullstudio.

Key Metrics to Monitor

Failure Rate

The percentage of jobs that fail processing. Why it matters:
  • Indicates bugs, external service issues, or bad data
  • Sudden spikes often signal deployments or dependency problems
  • Gradual increases may indicate data quality degradation
Healthy targets:
  • Critical queues: < 0.5%
  • Standard queues: < 1%
  • Best-effort queues: < 5%

Throughput

Jobs processed per unit time. Why it matters:
  • Shows system capacity and utilization
  • Drops indicate worker issues or upstream problems
  • Spikes may indicate traffic surges or catch-up processing
What to watch:
  • Consistent patterns during normal operation
  • Expected increases during peak hours
  • Unexpected drops that don’t recover

Processing Time

How long jobs take to complete. Why it matters:
  • Affects user experience for synchronous operations
  • Indicates code efficiency and external dependency performance
  • p95/p99 percentiles catch outliers that averages miss
Metrics to track:
  • Average processing time
  • p95 processing time (95% of jobs complete within this time)
  • p99 processing time (catches edge cases)

Queue Depth (Backlog)

Number of jobs waiting to be processed. Why it matters:
  • Growing backlog = workers can’t keep up
  • Consistent high depth = need more workers
  • Sudden spikes may indicate upstream issues
Healthy signs:
  • Backlog clears within acceptable timeframes
  • No consistent growth over time
  • Spikes recover after peak periods

Monitoring Strategy

Layered Alerting

Create alerts at different severity levels:
Layer 1: Warning (Slack/email)
├── Failure rate > 2%
├── Backlog > 500 jobs
└── Processing time > 2x baseline

Layer 2: Critical (PagerDuty/SMS)
├── Failure rate > 10%
├── Backlog > 2000 jobs
├── Missing workers > 5 minutes
└── Processing time > 5x baseline

Layer 3: Emergency
├── All workers down > 15 minutes
├── Backlog growing uncontrolled
└── Complete queue failure

Dashboard Review Cadence

FrequencyWhat to Check
Real-timeActive incidents, triggered alerts
DailyThroughput trends, failure spikes, slow jobs
WeeklyProcessing time trends, capacity utilization
MonthlyLong-term trends, threshold adjustments

Baseline Documentation

Document your normal operating parameters:
## Email Queue Baseline
- Normal throughput: 500-800 jobs/hour
- Peak throughput: 1500 jobs/hour (9am-10am)
- Normal failure rate: 0.3%
- Normal processing time: 150-300ms
- Normal queue depth: 5-20 jobs
- Workers: 3 instances

Queue-Specific Strategies

High-Priority Queues

For queues handling user-facing operations (e.g., order processing):

Tight Thresholds

  • Failure rate: Alert at 1%
  • Processing time: Alert at 2x baseline
  • Backlog: Alert at 10x normal

Fast Response

  • Short cooldown periods (5 min)
  • Multiple notification channels
  • Immediate escalation paths

Background Job Queues

For queues handling non-critical operations (e.g., analytics, cleanup):

Relaxed Thresholds

  • Failure rate: Alert at 5%
  • Processing time: Alert at 5x baseline
  • Longer time windows

Batched Notifications

  • Longer cooldown (1-4 hours)
  • Daily summary emails
  • Lower priority channels

Data Pipeline Queues

For queues handling data processing:

Accuracy Focus

  • Very low failure tolerance (< 0.1%)
  • Monitoring for data quality issues
  • Tracking completion rates

Performance Tracking

  • Processing time percentiles
  • Throughput vs expected volume
  • End-to-end pipeline timing

Common Patterns and Solutions

Pattern: Gradual Backlog Growth

Symptoms:
  • Queue depth slowly increases over days/weeks
  • No sudden spikes
Likely causes:
  • Insufficient workers for growing load
  • Processing time slowly increasing
  • Worker efficiency decreasing
Solutions:
  • Add more workers
  • Profile and optimize slow operations
  • Review recent code changes

Pattern: Periodic Failure Spikes

Symptoms:
  • Failures spike at regular intervals
  • Spikes correlate with specific times
Likely causes:
  • Scheduled jobs overwhelming resources
  • External service maintenance windows
  • Traffic patterns
Solutions:
  • Stagger scheduled jobs
  • Coordinate with external service schedules
  • Auto-scale workers during peak times

Pattern: Sudden Processing Time Increase

Symptoms:
  • Processing time jumps suddenly
  • No corresponding code changes
Likely causes:
  • External service degradation
  • Database performance issues
  • Network problems
Solutions:
  • Check external service status
  • Review database queries and indexes
  • Check network connectivity

Pattern: Intermittent Worker Disconnects

Symptoms:
  • Workers frequently disconnect/reconnect
  • “Missing workers” alerts that auto-resolve
Likely causes:
  • Memory issues causing worker crashes
  • Container orchestration issues
  • Network instability
Solutions:
  • Review worker memory usage
  • Check container health
  • Stabilize network connection

Observability Checklist

Use this checklist to ensure comprehensive monitoring:
  • Throughput (completed/failed per hour)
  • Processing time (avg, p95, p99)
  • Queue depth (waiting, active, delayed)
  • Worker count and status
  • Failure rate percentage
  • Failure rate alert
  • Backlog alert
  • Processing time alert
  • Missing workers alert
  • Alert recipients configured
  • Escalation paths defined
  • Baseline metrics documented
  • Runbooks for common issues
  • On-call rotation defined
  • Incident response process
  • Daily metric check
  • Weekly trend review
  • Monthly threshold adjustment
  • Quarterly capacity planning

Next Steps