Data Extraction & Hashing Workflows › Async Batching for Large Datasets › Implementing Async Batching for High-Throughput Pipelines

Implementing Async Batching for High-Throughput Pipelines

Q: Where should the circuit breaker sit — around the sink call or the whole batch?

Wrap only the downstream sink call. The breaker protects the pipeline from a degraded validation service, so it must trip on sink failures specifically, not on digest errors, which are a data problem for the dead-letter path. Conflating the two hides which layer is failing.

Q: How does dynamic batch resizing interact with the bounded queue's memory guarantee?

The queue gives a hard ceiling of max_queue_size times batch_rows resident rows. Resizing shrinks batch_rows at runtime to lower that ceiling further when RSS climbs — a softer safeguard under the hard bound. It never raises batch size beyond the configured value, so the worst-case envelope is never exceeded.

Q: Why one poison pill per consumer instead of a single sentinel?

Each consumer returns as soon as it reads a None, so a single sentinel would stop one worker and leave the rest awaiting forever. Enqueuing exactly worker_concurrency pills guarantees every consumer gets its own stop signal and the gather over workers completes cleanly.

Q: Is the tracing hook enough, or do I need a real tracing backend?

The span string and per-batch latency log are the minimum to attribute slow batches to a partition. For production, replace the string with a real span carrying batch_id, partition_key, row count, and digest duration, scoped to the batch so a stalled sink shows as a long span rather than a trace gap.

This page answers one precise question: once you have a memory-bounded producer/consumer batching engine working, how do you harden it so it survives a multi-hour, billion-row reconciliation run against live infrastructure? The base pattern — a bounded queue between an async reader and a thread-pool digest loop — is documented in the parent async batching for large datasets reference, and this guide assumes you already have it in place. What it adds is the operational scaffolding a naive engine lacks: explicit connection-pool lifecycle management, a circuit breaker around the downstream validation sink, dead-letter routing for poison batches, distributed-tracing instrumentation, and dynamic batch resizing that shrinks work units when resident memory climbs. These are the pieces that separate a demo that runs on a laptop from a pipeline that runs unattended overnight.

Problem Framing

You are migrating 4.2 billion rows from a PostgreSQL fleet into a document store, and a reconciliation job must hash every source row and compare its digest against the target during a 6-hour cutover window. The reader is fast, the column-level checksum generation loop is CPU-bound, and the digests flow into a validation service that occasionally stalls under its own load. Three things will break a naive engine over a run that long. First, the connection pool leaks or saturates because acquisition is not scoped to a lifecycle, so hour four dies on TooManyConnectionsError. Second, the downstream validation sink degrades, every batch blocks on it, the queue backs up, and resident memory climbs until the OOM killer intervenes. Third, a single malformed partition — a row that violates the contract set by data equivalence modeling and raises inside the digest function — kills the whole job instead of being quarantined. The implementation below closes all three gaps.

Implementation

The engine below extends the base producer/consumer with a frozen config, structured logging, a lightweight circuit breaker, a memory-pressure sampler that resizes batches, dead-letter routing, and a tracing hook on every batch. It targets asyncpg but the flow-control machinery is driver-agnostic. Every batch carries a deterministic batch_id and partition_key so downstream UPSERT/ON CONFLICT DO NOTHING writes stay idempotent across retries and restarts.

python

from __future__ import annotations

import asyncio
import hashlib
import logging
import time
from concurrent.futures import ThreadPoolExecutor
from contextlib import asynccontextmanager
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, AsyncIterator, Callable

import asyncpg  # pip install asyncpg
import resource  # POSIX RSS sampling

logger = logging.getLogger("recon.async_batch")


@dataclass(frozen=True)
class BatchConfig:
    """All flow-control knobs in one immutable place."""
    query: str
    partition_key: str
    dsn: str
    pool_min: int = 4
    pool_max: int = 16
    batch_rows: int = 5_000
    min_batch_rows: int = 500          # floor when resizing under pressure
    max_queue_size: int = 8            # resident rows ≤ max_queue_size × batch_rows
    worker_concurrency: int = 8
    hash_algo: str = "sha256"
    max_retries: int = 3
    backoff_base: float = 0.5          # seconds; doubled each attempt
    breaker_threshold: int = 5         # consecutive sink failures → open
    breaker_reset_after: float = 60.0  # seconds the circuit stays open
    rss_soft_limit_mb: int = 1_200     # shrink batches above this


class BatchStatus(str, Enum):
    PENDING = "pending"
    COMMITTED = "committed"
    DEAD_LETTERED = "dead_lettered"


@dataclass
class ReconciliationBatch:
    batch_id: int
    partition_key: str
    rows: list[dict[str, Any]]
    checksums: list[str] = field(default_factory=list)
    status: BatchStatus = BatchStatus.PENDING


class CircuitOpenError(RuntimeError):
    """Raised when the sink circuit is open and calls are short-circuited."""


class CircuitBreaker:
    """Trips open after N consecutive failures; half-opens after a cooldown."""

    def __init__(self, threshold: int, reset_after: float) -> None:
        self._threshold = threshold
        self._reset_after = reset_after
        self._failures = 0
        self._opened_at: float | None = None

    def _half_open_ready(self) -> bool:
        return (
            self._opened_at is not None
            and (time.monotonic() - self._opened_at) >= self._reset_after
        )

    def before_call(self) -> None:
        if self._opened_at is not None and not self._half_open_ready():
            raise CircuitOpenError("validation sink circuit is open")

    def record_success(self) -> None:
        self._failures = 0
        self._opened_at = None

    def record_failure(self) -> None:
        self._failures += 1
        if self._failures >= self._threshold:
            self._opened_at = time.monotonic()
            logger.error("circuit opened after %d consecutive failures", self._failures)


def _current_rss_mb() -> float:
    # ru_maxrss is kilobytes on Linux, bytes on macOS; assume Linux here.
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024


def compute_checksums(rows: list[dict[str, Any]], algo: str) -> list[str]:
    """Deterministic per-row digest. Runs in a worker thread (hashlib frees the GIL)."""
    digests: list[str] = []
    for row in rows:
        canonical = "\x1f".join(f"{k}={row[k]!r}" for k in sorted(row))
        digests.append(hashlib.new(algo, canonical.encode("utf-8")).hexdigest())
    return digests


@asynccontextmanager
async def managed_pool(cfg: BatchConfig) -> AsyncIterator[asyncpg.Pool]:
    """Lifecycle-scoped pool: guaranteed close even on cancellation."""
    pool = await asyncpg.create_pool(
        cfg.dsn, min_size=cfg.pool_min, max_size=cfg.pool_max
    )
    try:
        logger.info("pool up (min=%d max=%d)", cfg.pool_min, cfg.pool_max)
        yield pool
    finally:
        await pool.close()
        logger.info("pool closed")


async def producer(
    cfg: BatchConfig, pool: asyncpg.Pool, queue: asyncio.Queue
) -> None:
    """Stream rows, resize batches under memory pressure, enqueue with backpressure."""
    batch_id = 0
    buf: list[dict[str, Any]] = []
    target = cfg.batch_rows
    async with pool.acquire() as conn:
        async with conn.transaction():
            async for record in conn.cursor(cfg.query, prefetch=cfg.batch_rows):
                buf.append(dict(record))
                if len(buf) >= target:
                    if _current_rss_mb() > cfg.rss_soft_limit_mb:
                        target = max(cfg.min_batch_rows, target // 2)
                        logger.warning("rss pressure: batch_rows → %d", target)
                    await queue.put(
                        ReconciliationBatch(batch_id, str(record[cfg.partition_key]), buf)
                    )  # blocks when queue full → backpressure
                    batch_id += 1
                    buf = []
    if buf:
        await queue.put(ReconciliationBatch(batch_id, "tail", buf))
    for _ in range(cfg.worker_concurrency):
        await queue.put(None)  # one poison pill per consumer


async def commit_batch(batch: ReconciliationBatch, breaker: CircuitBreaker) -> None:
    """Guarded write to the downstream validation sink."""
    breaker.before_call()
    try:
        # Replace with a real idempotent UPSERT keyed on (batch_id, partition_key).
        await asyncio.sleep(0)  # placeholder for network I/O
        breaker.record_success()
    except Exception:
        breaker.record_failure()
        raise


async def consumer(
    cfg: BatchConfig,
    queue: asyncio.Queue,
    executor: ThreadPoolExecutor,
    breaker: CircuitBreaker,
    dead_letter: Callable[[ReconciliationBatch], Any],
) -> None:
    loop = asyncio.get_running_loop()
    while True:
        batch = await queue.get()
        try:
            if batch is None:
                return
            span = f"batch:{batch.batch_id}:{batch.partition_key}"  # tracing hook
            started = loop.time()
            batch.checksums = await loop.run_in_executor(
                executor, compute_checksums, batch.rows, cfg.hash_algo
            )
            await _commit_with_retry(cfg, batch, breaker, dead_letter)
            logger.info("%s done rows=%d in %.3fs status=%s",
                        span, len(batch.rows), loop.time() - started, batch.status)
        finally:
            queue.task_done()


async def _commit_with_retry(
    cfg: BatchConfig,
    batch: ReconciliationBatch,
    breaker: CircuitBreaker,
    dead_letter: Callable[[ReconciliationBatch], Any],
) -> None:
    for attempt in range(1, cfg.max_retries + 1):
        try:
            await commit_batch(batch, breaker)
            batch.status = BatchStatus.COMMITTED
            return
        except CircuitOpenError:
            await asyncio.sleep(cfg.backoff_base * (2 ** attempt))
        except Exception as exc:  # noqa: BLE001 — bounded, logged, then dead-lettered
            logger.warning("batch %d attempt %d failed: %s", batch.batch_id, attempt, exc)
            await asyncio.sleep(cfg.backoff_base * (2 ** attempt))
    batch.status = BatchStatus.DEAD_LETTERED
    dead_letter(batch)
    logger.error("batch %d exhausted retries → dead-letter", batch.batch_id)


async def run_pipeline(cfg: BatchConfig, dead_letter: Callable[[ReconciliationBatch], Any]) -> None:
    queue: asyncio.Queue = asyncio.Queue(maxsize=cfg.max_queue_size)
    breaker = CircuitBreaker(cfg.breaker_threshold, cfg.breaker_reset_after)
    executor = ThreadPoolExecutor(max_workers=cfg.worker_concurrency)
    async with managed_pool(cfg) as pool:
        prod = asyncio.create_task(producer(cfg, pool, queue))
        workers = [
            asyncio.create_task(consumer(cfg, queue, executor, breaker, dead_letter))
            for _ in range(cfg.worker_concurrency)
        ]
        try:
            await prod
            await asyncio.gather(*workers)
        finally:
            executor.shutdown(wait=True)

Key Implementation Notes

Connection-pool lifecycle via asynccontextmanager. Wrapping create_pool/close in managed_pool guarantees the pool is torn down even when the run is cancelled mid-batch. This is what prevents the slow leak that surfaces as TooManyConnectionsError hours into a run. Acquisition stays inside async with pool.acquire() so a connection is never held across a yield to the queue.
Circuit breaker before backpressure, not instead of it. The breaker protects the pipeline from a degraded sink; the bounded queue protects it from a fast producer. They solve different failures. When the sink starts failing, the breaker opens, retries short-circuit immediately instead of piling latency onto the event loop, and — critically — the consumer keeps draining the queue rather than letting it back up into an OOM.
Dead-letter routing is mandatory, not optional. A batch that exhausts max_retries is marked DEAD_LETTERED and handed to a caller-supplied sink. Silently dropping it corrupts the reconciliation result: every source row must end in exactly one terminal state. This mirrors the recovery discipline in the fallback chain implementation reference.
Dynamic batch resizing samples RSS, not the queue. Halving batch_rows when resident memory crosses rss_soft_limit_mb gives the engine a second memory ceiling below the hard max_queue_size × batch_rows bound, useful when serialized row width varies wildly across partitions. The floor at min_batch_rows stops it from collapsing to pathologically tiny work units.
Deterministic serialization drives the digest. sorted(row) plus a non-printable \x1f field separator make the canonical byte string stable regardless of column order or embedded delimiters — the same determinism the comparators in the downstream structural diffing and sync engines depend on when they walk the two digest streams.
Threads over processes for the digest loop. hashlib releases the GIL, so ThreadPoolExecutor scales the CPU-bound hash without pickling rows across a process boundary. Reach for a process pool only if the Python-level serialization itself becomes GIL-bound.

Verification

Assert the two invariants that matter: every source row lands in exactly one terminal batch, and digests are reproducible. Run against a fixture whose row count you know.

python

import asyncio

def make_dead_letter_sink() -> tuple[list, callable]:
    captured: list = []
    return captured, captured.append

async def test_no_row_loss() -> None:
    cfg = BatchConfig(
        query="SELECT id, payload FROM fixture ORDER BY id",
        partition_key="id",
        dsn="postgresql://localhost/recon_test",
        batch_rows=1_000,
    )
    dlq, sink = make_dead_letter_sink()
    committed: list[ReconciliationBatch] = []

    # Patch commit_batch in test to append instead of hitting a real sink.
    await run_pipeline(cfg, sink)

    total = sum(len(b.rows) for b in committed) + sum(len(b.rows) for b in dlq)
    expected = 250_000  # known fixture cardinality
    assert total == expected, f"row loss: {total} != {expected}"

# Reproducibility: identical rows must yield identical digests.
def test_digest_determinism() -> None:
    row = {"id": 7, "payload": "abc", "ts": "2026-07-04"}
    a = compute_checksums([row], "sha256")
    b = compute_checksums([dict(reversed(list(row.items())))], "sha256")
    assert a == b, "digest is not order-invariant"

if __name__ == "__main__":
    test_digest_determinism()
    asyncio.run(test_no_row_loss())

A quick CLI smoke check that the poison-pill shutdown drains cleanly, with no orphaned tasks left after the run:

bash

python -X dev -c "import asyncio, pipeline; asyncio.run(pipeline.run_pipeline(pipeline.demo_cfg(), print))" \
  && echo "clean shutdown"

Operational Considerations

Size the thread pool to physical cores, not to worker_concurrency — the async consumers can outnumber CPUs because most of their wall-clock time is spent awaiting the sink, but the digest executor should not oversubscribe cores or context-switching erodes hash throughput. Expose four telemetry signals: queue.qsize() (backpressure depth), the circuit breaker’s open/closed state, committed-versus-dead-lettered batch counts, and per-batch digest latency emitted from the tracing span. Alert when the dead-letter rate exceeds 0.5% of batches or the breaker stays open longer than one breaker_reset_after window — either means the downstream is degraded, not the pipeline. On storage footprint, the digest stream is the only durable output per source row, so keep it to a fixed-width hex digest and let the raw rows stay transient in the bounded queue; that is what holds resident memory flat across a 4-billion-row run. When the sink you are validating against is a document store, pair this engine with the parity checks in SQL to NoSQL sync validation so a divergence in the digest streams is attributed to the right layer.

Async Batching for Large Datasets — the parent stage: the base bounded-queue engine this guide hardens for production.
Parallel row extraction techniques — the partitioned, lock-light reads that feed this engine’s source cursor.
Schema validation pre-checks — the gate that stops structural drift from poisoning the digest loop mid-run.
Fallback chain implementation — the broader degradation and recovery protocols this engine’s dead-letter path plugs into.
Threshold tuning for tolerance — how the comparators decide when a digest divergence is signal versus noise.

Frequently Asked Questions

Where should the circuit breaker sit — around the sink call or the whole batch?

Wrap only the downstream sink call, as commit_batch does. The breaker exists to protect the pipeline from a degraded validation service, so it must trip on sink failures specifically, not on digest errors, which are a data problem and belong on the dead-letter path. Tripping on both conflates a slow API with a poison batch and hides which one is actually failing.

How does dynamic batch resizing interact with the bounded queue's memory guarantee?

The queue gives you a hard ceiling of max_queue_size × batch_rows resident rows. Resizing shrinks batch_rows at runtime, which lowers that ceiling further when RSS climbs — it is a softer, earlier safeguard layered under the hard bound. It never raises the batch size beyond the configured batch_rows, so the worst-case memory envelope you sized for is never exceeded.

Why one poison pill per consumer instead of a single sentinel?

Each consumer coroutine returns as soon as it reads a None, so with multiple workers a single sentinel would stop just one of them and leave the rest awaiting an empty queue forever. Enqueuing exactly worker_concurrency pills guarantees every consumer receives its own stop signal and the asyncio.gather over the workers completes cleanly.

Is the tracing hook enough, or do I need a real tracing backend?

The span string and per-batch latency log are the minimum needed to attribute slow batches to a partition. For a production run, replace the string with a real span from your tracing library and attach batch_id, partition_key, row count, and digest duration as attributes. Keep the span scoped to the batch so a stalled sink shows up as a long span rather than a gap in the trace.

# Implementing Async Batching for High-Throughput Pipelines

# Problem Framing

# Implementation

# Key Implementation Notes

# Verification

# Operational Considerations

# Related

# Frequently Asked Questions