Monitoring Best Practices

Key Metrics to Monitor
Failure Rate
Throughput
Processing Time
Queue Depth (Backlog)
Monitoring Strategy
Layered Alerting
Dashboard Review Cadence
Baseline Documentation
Queue-Specific Strategies
High-Priority Queues
Background Job Queues
Data Pipeline Queues
Common Patterns and Solutions
Pattern: Gradual Backlog Growth
Pattern: Periodic Failure Spikes
Pattern: Sudden Processing Time Increase
Pattern: Intermittent Worker Disconnects
Observability Checklist
Next Steps

Effective monitoring helps you catch issues before they impact users, understand system performance, and make data-driven decisions. This guide covers best practices for monitoring BullMQ queues with bullstudio.

Key Metrics to Monitor

Failure Rate

The percentage of jobs that fail processing. Why it matters:

Indicates bugs, external service issues, or bad data
Sudden spikes often signal deployments or dependency problems
Gradual increases may indicate data quality degradation

Healthy targets:

Critical queues: < 0.5%
Standard queues: < 1%
Best-effort queues: < 5%

Throughput

Jobs processed per unit time. Why it matters:

Shows system capacity and utilization
Drops indicate worker issues or upstream problems
Spikes may indicate traffic surges or catch-up processing

What to watch:

Consistent patterns during normal operation
Expected increases during peak hours
Unexpected drops that don’t recover

Processing Time

How long jobs take to complete. Why it matters:

Affects user experience for synchronous operations
Indicates code efficiency and external dependency performance
p95/p99 percentiles catch outliers that averages miss

Metrics to track:

Average processing time
p95 processing time (95% of jobs complete within this time)
p99 processing time (catches edge cases)

Queue Depth (Backlog)

Number of jobs waiting to be processed. Why it matters:

Growing backlog = workers can’t keep up
Consistent high depth = need more workers
Sudden spikes may indicate upstream issues

Healthy signs:

Backlog clears within acceptable timeframes
No consistent growth over time
Spikes recover after peak periods

Monitoring Strategy

Layered Alerting

Create alerts at different severity levels:

Layer 1: Warning (Slack/email)
├── Failure rate > 2%
├── Backlog > 500 jobs
└── Processing time > 2x baseline

Layer 2: Critical (PagerDuty/SMS)
├── Failure rate > 10%
├── Backlog > 2000 jobs
├── Missing workers > 5 minutes
└── Processing time > 5x baseline

Layer 3: Emergency
├── All workers down > 15 minutes
├── Backlog growing uncontrolled
└── Complete queue failure

Dashboard Review Cadence

Frequency	What to Check
Real-time	Active incidents, triggered alerts
Daily	Throughput trends, failure spikes, slow jobs
Weekly	Processing time trends, capacity utilization
Monthly	Long-term trends, threshold adjustments

Baseline Documentation

Document your normal operating parameters:

## Email Queue Baseline
- Normal throughput: 500-800 jobs/hour
- Peak throughput: 1500 jobs/hour (9am-10am)
- Normal failure rate: 0.3%
- Normal processing time: 150-300ms
- Normal queue depth: 5-20 jobs
- Workers: 3 instances

Queue-Specific Strategies

High-Priority Queues

For queues handling user-facing operations (e.g., order processing):

Tight Thresholds

Failure rate: Alert at 1%
Processing time: Alert at 2x baseline
Backlog: Alert at 10x normal

Fast Response

Short cooldown periods (5 min)
Multiple notification channels
Immediate escalation paths

Background Job Queues

For queues handling non-critical operations (e.g., analytics, cleanup):

Relaxed Thresholds

Failure rate: Alert at 5%
Processing time: Alert at 5x baseline
Longer time windows

Batched Notifications

Longer cooldown (1-4 hours)
Daily summary emails
Lower priority channels

Data Pipeline Queues

For queues handling data processing:

Accuracy Focus

Very low failure tolerance (< 0.1%)
Monitoring for data quality issues
Tracking completion rates

Performance Tracking

Processing time percentiles
Throughput vs expected volume
End-to-end pipeline timing

Common Patterns and Solutions

Pattern: Gradual Backlog Growth

Symptoms:

Queue depth slowly increases over days/weeks
No sudden spikes

Likely causes:

Insufficient workers for growing load
Processing time slowly increasing
Worker efficiency decreasing

Solutions:

Add more workers
Profile and optimize slow operations
Review recent code changes

Pattern: Periodic Failure Spikes

Symptoms:

Failures spike at regular intervals
Spikes correlate with specific times

Likely causes:

Scheduled jobs overwhelming resources
External service maintenance windows
Traffic patterns

Solutions:

Stagger scheduled jobs
Coordinate with external service schedules
Auto-scale workers during peak times

Pattern: Sudden Processing Time Increase

Symptoms:

Processing time jumps suddenly
No corresponding code changes

Likely causes:

External service degradation
Database performance issues
Network problems

Solutions:

Check external service status
Review database queries and indexes
Check network connectivity

Pattern: Intermittent Worker Disconnects

Symptoms:

Workers frequently disconnect/reconnect
“Missing workers” alerts that auto-resolve

Likely causes:

Memory issues causing worker crashes
Container orchestration issues
Network instability

Solutions:

Review worker memory usage
Check container health
Stabilize network connection

Observability Checklist

Use this checklist to ensure comprehensive monitoring:

Metrics Collection

Throughput (completed/failed per hour)
Processing time (avg, p95, p99)
Queue depth (waiting, active, delayed)
Worker count and status
Failure rate percentage

Alerting

Documentation

Baseline metrics documented
Runbooks for common issues
On-call rotation defined
Incident response process

Regular Reviews

Daily metric check
Weekly trend review
Monthly threshold adjustment
Quarterly capacity planning

Next Steps

Setting Up Alerts

Configure alerts for your queues.

Dashboard

Explore the dashboard metrics.

Troubleshooting

Solve common queue issues.

Job Management

Investigate individual jobs.

Setting Up Alerts Troubleshooting

Getting Started

Core Features

Platform

Self-Hosting

Guides

Monitoring Best Practices

Key Metrics to Monitor

Failure Rate

Throughput

Processing Time

Queue Depth (Backlog)

Monitoring Strategy

Layered Alerting

Dashboard Review Cadence

Baseline Documentation

Queue-Specific Strategies

High-Priority Queues

Tight Thresholds

Fast Response

Background Job Queues

Relaxed Thresholds

Batched Notifications

Data Pipeline Queues

Accuracy Focus

Performance Tracking

Common Patterns and Solutions

Pattern: Gradual Backlog Growth

Pattern: Periodic Failure Spikes

Pattern: Sudden Processing Time Increase

Pattern: Intermittent Worker Disconnects

Observability Checklist

Next Steps

Setting Up Alerts

Dashboard

Troubleshooting

Job Management

Getting Started

Core Features

Platform

Self-Hosting

Guides

​Key Metrics to Monitor

​Failure Rate

​Throughput

​Processing Time

​Queue Depth (Backlog)

​Monitoring Strategy

​Layered Alerting

​Dashboard Review Cadence

​Baseline Documentation

​Queue-Specific Strategies

​High-Priority Queues

Tight Thresholds

Fast Response

​Background Job Queues

Relaxed Thresholds

Batched Notifications

​Data Pipeline Queues

Accuracy Focus

Performance Tracking

​Common Patterns and Solutions

​Pattern: Gradual Backlog Growth

​Pattern: Periodic Failure Spikes

​Pattern: Sudden Processing Time Increase

​Pattern: Intermittent Worker Disconnects

​Observability Checklist

​Next Steps

Setting Up Alerts

Dashboard

Troubleshooting

Job Management

Key Metrics to Monitor

Failure Rate

Throughput

Processing Time

Queue Depth (Backlog)

Monitoring Strategy

Layered Alerting

Dashboard Review Cadence

Baseline Documentation

Queue-Specific Strategies

High-Priority Queues

Background Job Queues

Data Pipeline Queues

Common Patterns and Solutions

Pattern: Gradual Backlog Growth

Pattern: Periodic Failure Spikes

Pattern: Sudden Processing Time Increase

Pattern: Intermittent Worker Disconnects

Observability Checklist

Next Steps