Building resilient message processing systems requires handling failures gracefully. When Broadway processes messages through pipelines, transient failures happen. Network issues, temporary service outages, and data inconsistencies are inevitable in distributed systems.

The Dead Letter Queue (DLQ) pattern provides a robust solution for handling failed messages at the infrastructure level. By leveraging RabbitMQ's native capabilities, you can implement automatic retries, exponential backoff, and failure isolation without cluttering your application code.

Understanding the Pattern

Broadway excels at processing messages through concurrent pipelines but doesn't provide built-in retry mechanisms. The DLQ pattern fills this gap by implementing retry logic at the RabbitMQ infrastructure level using exchange and queue configurations.

The architecture separates concerns cleanly:

Broadway handles message acknowledgment and processing
RabbitMQ manages retry timing and routing
Your Pipeline implements business logic and retry decisions

Queue Architecture

The system creates three specialized queues per processing step:

1. Worker Queue

The primary queue where Broadway consumers receive messages for processing. Normal message flow starts here. Successful processing completes the cycle, while failures trigger the retry mechanism.

2. Retry Queue

Failed messages land here with a TTL (Time To Live) configuration. RabbitMQ automatically moves expired messages back to the worker queue using Dead Letter Exchange (DLX) routing. This creates time-delayed retries without custom timers or scheduled jobs.

3. Dead Queue

Permanently failed messages accumulate here after exhausting retry attempts. This queue enables manual inspection, debugging, and alternative handling strategies without blocking the main processing pipeline.

How It Works

Queue Setup

When a producer starts, it declares the complete queue topology:

# Declare retry queue with dead-letter exchange configuration
AMQP.Queue.declare(channel, "my_queue.retry",
  durable: true,
  arguments: [
    {"x-dead-letter-exchange", :longstr, "my_exchange.worker"},
    {"x-dead-letter-routing-key", :longstr, "my_queue.worker"},
    {"x-queue-version", 2}
  ]
)

The retry queue's x-dead-letter-exchange argument tells RabbitMQ where to route expired messages. When a message's TTL expires, RabbitMQ automatically publishes it back to the worker queue.

Normal Processing Flow

Messages flow through the happy path:

Message → Worker Queue → Broadway Pipeline → Success → ACK

Broadway acknowledges successful messages, and they're removed from the queue permanently.

Failure and Retry Flow

When processing fails, Broadway's handle_failed/2 callback routes messages to the retry queue:

Message → Worker Queue → Broadway Pipeline → Failure
  ↓
handle_failed/2 callback publishes to Retry Queue (with expiration)
  ↓
After TTL expires → RabbitMQ DLX routes back to Worker Queue
  ↓
Retry processing attempt

The retry queue acts as a holding area. RabbitMQ's TTL mechanism provides automatic time delays, creating exponential backoff without complex scheduling logic.

Permanent Failure

After exhausting retries, messages move to the dead queue. This prevents infinite retry loops and provides visibility into systemic issues requiring manual intervention.

Implementation Guide

Step 1: Configure Producer with Queue Declaration

The producer declares all three queues with proper configuration:

defmodule AnswerProcessing.MyStep.Producer do
  use Broadway.Producer

  def start_link(opts) do
    connection = Keyword.fetch!(opts, :connection)
    queue_name = "my_step"

    # Declare exchanges
    AMQP.Exchange.declare(channel, "#{queue_name}.worker", :direct, durable: true)
    AMQP.Exchange.declare(channel, "#{queue_name}.failed", :direct, durable: true)

    # Declare worker queue
    AMQP.Queue.declare(channel, "#{queue_name}.worker", durable: true)
    AMQP.Queue.bind(channel, "#{queue_name}.worker", "#{queue_name}.worker")

    # Declare retry queue with DLX
    AMQP.Queue.declare(channel, "#{queue_name}.retry",
      durable: true,
      arguments: [
        {"x-dead-letter-exchange", :longstr, "#{queue_name}.worker"},
        {"x-dead-letter-routing-key", :longstr, "#{queue_name}.worker"}
      ]
    )

    # Declare dead queue
    AMQP.Queue.declare(channel, "#{queue_name}.dead", durable: true)

    # ... producer implementation
  end

  def publish(queue_type, message, opts \\\\ []) do
    case queue_type do
      :worker -> publish_to_worker(message, opts)
      :retry -> publish_to_retry(message, opts)
      :dead -> publish_to_dead(message, opts)
    end
  end
end

Step 2: Implement Retry Logic with Exponential Backoff

The pipeline's handle_failed/2 callback implements retry strategy:

defmodule MyApp.Pipeline do
  use Broadway

  @max_retries 3

  def handle_failed(messages, _context) do
    Enum.map(messages, fn message ->
      retry_count = get_retry_count(message)

      if retry_count < @max_retries do
        # Calculate exponential backoff: 30s, 60s, 120s
        expiration = :timer.seconds(30 * :math.pow(2, retry_count))

        # Publish to retry queue
        message.data
        |> increment_retry_count()
        |> Jason.encode!()
        |> MyStep.Producer.publish(:retry,
             expiration: expiration,
             priority: message.metadata.priority
           )
      else
        # Max retries exhausted, send to dead queue
        message.data
        |> Jason.encode!()
        |> MyStep.Producer.publish(:dead, priority: 0)
      end

      message
    end)
  end

  defp get_retry_count(%{data: %{"retry_count" => count}}), do: count
  defp get_retry_count(_), do: 0

  defp increment_retry_count(%{"retry_count" => count} = data),
    do: %{data | "retry_count" => count + 1}
  defp increment_retry_count(data),
    do: Map.put(data, "retry_count", 1)
end

This implementation provides:

Exponential backoff - Each retry waits progressively longer (30s, 60s, 120s)
Retry limits - Prevents infinite loops with max retry count
Metadata tracking - Embeds retry count in message payload
Priority preservation - Maintains message priority through retries

Step 3: Configure Complete Step

Wire everything together in your supervision tree:

defmodule AnswerProcessing.MyStep do
  use AnswerProcessing.Step,
    implementation: AnswerProcessing.MyStepImpl,
    producer: [queue_name: "my_step"],
    pipeline: [queue: "my_step.worker"]
end

# In supervisor
children = [
  {AnswerProcessing.MyStep.Producer, connection: rabbitmq_connection},
  {AnswerProcessing.MyStep.Pipeline,
   connection: rabbitmq_connection,
   next_step: AnswerProcessing.NextStep
  }
]

Supervisor.start_link(children, strategy: :one_for_one)

Key Benefits

Automatic Retry Without Custom Timers

RabbitMQ's TTL + DLX combination provides time-delayed reprocessing automatically. No GenServer timers, no external schedulers, no manual tracking. The message broker handles everything.

Failure Isolation

Dead queue separates permanent failures from transient ones. Your monitoring system can alert on dead queue growth, while retry queue fluctuations are normal and expected.

No Message Loss

Every message exists in exactly one queue at all times. Worker, retry, or dead. Nothing falls through the cracks. Durable queues survive broker restarts.

Independent Queue Monitoring

Each queue provides distinct metrics. Worker queue depth indicates processing capacity. Retry queue shows transient failure rate. Dead queue reveals systemic problems requiring attention.

Stateless Pipeline Logic

Broadway callbacks remain stateless. No retry state tracking, no timer management, no complex error handling logic. The infrastructure handles retry mechanics.

Broadway's Role

Broadway focuses on what it does best:

Message acknowledgment - Handles successful and failed message acknowledgment
Failure isolation - Processes messages through stateless callbacks
Immediate acknowledgment - Prevents RabbitMQ redelivery by acknowledging failures immediately

Broadway does NOT handle:

Retry logic - Delegated to producer/pipeline routing decisions
TTL management - Handled by RabbitMQ queue configuration
DLQ routing - Implemented at infrastructure level

This separation of concerns keeps the architecture clean and maintainable.

Production Considerations

Monitoring and Alerting

Monitor queue depths independently:

Worker queue - Alert on sustained high depth (processing bottleneck)
Retry queue - Track for transient failure patterns
Dead queue - Alert immediately on growth (systemic issues)

Message Expiration Tuning

Adjust exponential backoff parameters based on your failure characteristics. Network issues might need shorter delays, while rate-limited APIs benefit from longer waits.

Dead Queue Processing

Implement tooling to inspect and reprocess dead queue messages. Consider administrative endpoints for manual retry or batch reprocessing after fixing underlying issues.

Retry Count Persistence

Store retry count in message payload, not headers. This ensures retry tracking survives message transformations and provides debugging visibility.

Key Takeaways

Infrastructure-level retries using RabbitMQ TTL and Dead Letter Exchange
Three-queue architecture separating worker, retry, and dead messages
Exponential backoff implemented through increasing TTL values
Stateless Broadway pipelines with routing decisions in handle_failed/2
No message loss with durable queues and explicit acknowledgment
Independent monitoring for worker, retry, and dead queue health
Clean separation between Broadway processing and retry infrastructure

The Dead Letter Queue pattern with Broadway and RabbitMQ provides production-grade message processing resilience. By leveraging infrastructure capabilities, you build robust systems without complex application logic.