Why quantize decimals before hashing instead of comparing with an epsilon afterwards?

Hashing collapses a value to a fixed digest, destroying magnitude. Two amounts that agree below tolerance must therefore map to identical bytes before hashing, so the decimal is quantized to the tolerance grid first. An epsilon comparison after hashing is impossible because the digests already differ.

How does the model keep a NULL, a missing key, and an empty string distinct?

It uses a tri-state resolver: NULL canonicalizes to a dedicated sentinel byte string, an empty string canonicalizes to the string type tag, and a missing key never contributes a field digest. Flattening these three states produces false positives indistinguishable from real corruption.

Why per-field digests instead of a single row hash?

Per-field digests keep a divergence attributable to a specific column, which is what a discrepancy manifest and tolerance tuning need. A single row hash tells you only that the row changed, not which value drifted.

Is BLAKE2b or SHA-256 the right digest for the row signature?

Both are collision-resistant and defensible in an audit trail. BLAKE2b is typically faster in Python and supports a configurable digest size; SHA-256 is a drop-in where a NIST-standardized algorithm is mandated. MD5 is disqualified by its collision weakness.

Cross-Engine Data Reconciliation Architecture › Data Equivalence Modeling › Building Equivalence Models for Heterogeneous Databases

Building Equivalence Models for Heterogeneous Databases

This page answers one precise question: how do you build the model that decides whether a row read from PostgreSQL and a document read from MongoDB (or an item from DynamoDB) represent the same logical record, when they share almost no bytes? It is the concrete implementation layer of data equivalence modeling — the stage that sits after rows are extracted and before any diff is computed. The prerequisite context is assumed: both engines’ schemas have already passed the schema validation pre-checks gate, a stable reconciliation key exists on both sides, and rows arrive sorted by that key. What remains is to turn engine-shaped values into one canonical byte representation so that the invariant “equal bytes ⇒ equal record” holds across storage engines.

Problem Framing: Silent Truncation Across a Migration

Concretely: you are migrating 4 billion ledger rows from PostgreSQL to MongoDB, with a DynamoDB read-replica fronting a mobile API. The source amount column is NUMERIC(38,18); MongoDB stores it as Decimal128; DynamoDB stores it as Number, whose precision ceiling is 38 significant digits. A naive comparison fails in three ways at once. Row counts match, so a count check reports success while values silently disagree. A whole-table checksum reports that something diverged but never says which record or column. And a raw byte comparison flags every row as different, because Decimal('1.500000000000000000'), the BSON encoding of the same number, and the DynamoDB string "1.5" are all distinct on the wire despite being the same business value.

An equivalence model resolves all three. It normalizes every value through an explicit type-coercion matrix, folds per-field digests into a per-row signature so divergence is attributable to a specific key and column, and applies a tolerance ladder so that representation-only differences never register as drift. The same model is later reused during SQL to NoSQL sync validation and by the structural mismatch detection stage that consumes its divergence stream.

The Type-Coercion Matrix

The model begins with a strict coercion matrix that maps each engine type onto an engine-agnostic canonical form. The type-translation contracts themselves are governed by the cross-platform schema mapping reference; the matrix below is the value-level rule set the equivalence model enforces at runtime.

Source type	Target types	Canonical form	Coercion rule
`NUMERIC(p,s)`	`Decimal128`, `Number`	`decimal.Decimal`	Fixed precision, quantize to scale `s`, `ROUND_HALF_EVEN`
`TIMESTAMPTZ`	`Date`, `ISODate`	`datetime` (UTC)	Convert to UTC, ISO-8601 with explicit `%f` precision
`VARCHAR/TEXT`	`String`	`str` (NFC)	Unicode NFC normalize, strip surrounding whitespace
`JSONB`	`Document`	`dict` (sorted keys)	Recursive key sort, deterministic array ordering
`NULL`	absent key / `""`	`__NULL__` sentinel	Tri-state resolver: `IS_NULL` vs `IS_EMPTY` vs `HAS_VALUE`

Two rules dominate correctness. Decimals must be quantized before hashing rather than compared with an epsilon afterwards — hashing collapses a value to a fixed digest, so two amounts that agree below tolerance must map to identical bytes first, or their digests differ and the tolerance never applies. And nulls must stay tri-state: a relational NULL, a NoSQL missing key, and an empty string are three distinct states, and flattening them produces false positives that are indistinguishable from real corruption.

Implementation

The model is expressed as a frozen configuration plus three pure functions: canonicalize a single value to bytes, fold a row’s field digests into an order-independent signature, and classify a source/target pair. Every determinism-critical parameter lives in one place so the byte output is reproducible across hosts. The digest uses BLAKE2b for speed with a configurable size; SHA-256 is a drop-in where a NIST-standardized algorithm is mandated. The rationale for that choice is developed in the MD5 vs SHA-256 checksum comparison.

python

from __future__ import annotations

import hashlib
import logging
import unicodedata
from dataclasses import dataclass, field
from datetime import datetime, timezone
from decimal import Decimal, localcontext, ROUND_HALF_EVEN
from enum import Enum
from typing import Any, Iterable, Iterator

logger = logging.getLogger("equivalence.model")

NULL_SENTINEL = b"\x00__NULL__\x00"


class Verdict(str, Enum):
    MATCH = "MATCH"                        # identical canonical signatures
    TOLERANCE_MATCH = "TOLERANCE_MATCH"    # equal within numeric/temporal tolerance
    MISSING_IN_SOURCE = "MISSING_IN_SOURCE"
    MISSING_IN_TARGET = "MISSING_IN_TARGET"
    DRIFT = "DRIFT"                        # unresolved divergence


@dataclass(frozen=True)
class EquivalenceConfig:
    """Every determinism-critical parameter in one frozen place."""
    decimal_prec: int = 38
    decimal_scale: int = 18
    float_epsilon: Decimal = Decimal("1e-9")
    temporal_skew_seconds: int = 1
    digest_size: int = 16                  # 128-bit BLAKE2b digest
    key_columns: tuple[str, ...] = ("id",)


def _canon_value(value: Any, cfg: EquivalenceConfig) -> bytes:
    """Serialize one value to deterministic, type-tagged bytes.

    Type tags prevent cross-type collisions (the string "1" must not hash
    equal to the integer 1). Decimals are quantized to the tolerance grid so
    representation-only differences vanish before hashing.
    """
    if value is None:
        return NULL_SENTINEL
    if isinstance(value, str):
        return b"s" + unicodedata.normalize("NFC", value.strip()).encode("utf-8")
    if isinstance(value, bool):
        return b"b1" if value else b"b0"
    if isinstance(value, (int, Decimal, float)):
        with localcontext() as ctx:
            ctx.prec = cfg.decimal_prec
            ctx.rounding = ROUND_HALF_EVEN
            quantum = Decimal(10) ** -cfg.decimal_scale
            d = Decimal(str(value)).quantize(quantum)
        return b"n" + str(d).encode("ascii")
    if isinstance(value, datetime):
        aware = value if value.tzinfo else value.replace(tzinfo=timezone.utc)
        return b"t" + aware.astimezone(timezone.utc).isoformat().encode("ascii")
    if isinstance(value, dict):
        parts = [
            _canon_value(k, cfg) + b"=" + _canon_value(v, cfg)
            for k, v in sorted(value.items())
        ]
        return b"d[" + b",".join(parts) + b"]"
    if isinstance(value, (list, tuple)):
        # Order-normalize so document arrays and relational orderings agree.
        parts = sorted(_canon_value(v, cfg) for v in value)
        return b"a[" + b",".join(parts) + b"]"
    raise TypeError(f"no canonical rule for {type(value)!r}")


def field_digests(record: dict, cfg: EquivalenceConfig) -> dict[str, bytes]:
    """Per-field digests: divergence stays attributable to a column."""
    return {
        col: hashlib.blake2b(_canon_value(val, cfg), digest_size=cfg.digest_size).digest()
        for col, val in record.items()
    }


def row_signature(digests: dict[str, bytes], cfg: EquivalenceConfig) -> bytes:
    """Fold field digests into one row signature over sorted column names."""
    h = hashlib.blake2b(digest_size=cfg.digest_size)
    for col in sorted(digests):
        h.update(col.encode("utf-8"))
        h.update(digests[col])
    return h.digest()


def _within_tolerance(src: dict, tgt: dict, cols: Iterable[str], cfg: EquivalenceConfig) -> bool:
    """Second-chance check: are the differing columns equal within tolerance?"""
    for col in cols:
        s, t = src.get(col), tgt.get(col)
        if isinstance(s, (int, float, Decimal)) and isinstance(t, (int, float, Decimal)):
            if abs(Decimal(str(s)) - Decimal(str(t))) <= cfg.float_epsilon:
                continue
        if isinstance(s, datetime) and isinstance(t, datetime):
            if abs((s - t).total_seconds()) <= cfg.temporal_skew_seconds:
                continue
        return False
    return True


def classify(src: dict | None, tgt: dict | None, cfg: EquivalenceConfig) -> Verdict:
    if src is None:
        return Verdict.MISSING_IN_SOURCE
    if tgt is None:
        return Verdict.MISSING_IN_TARGET
    s_fields, t_fields = field_digests(src, cfg), field_digests(tgt, cfg)
    if row_signature(s_fields, cfg) == row_signature(t_fields, cfg):
        return Verdict.MATCH
    differing = {c for c in set(s_fields) | set(t_fields)
                 if s_fields.get(c) != t_fields.get(c)}
    if _within_tolerance(src, tgt, differing, cfg):
        logger.info("tolerance match on columns=%s", sorted(differing))
        return Verdict.TOLERANCE_MATCH
    logger.warning("drift on columns=%s", sorted(differing))
    return Verdict.DRIFT

The classifier is walked over two key-sorted streams with an O(N) merge so comparison memory stays constant regardless of table size. The streams arrive from the parallel row extraction techniques and async batching stages upstream.

python

def reconcile(
    source: Iterator[dict],
    target: Iterator[dict],
    cfg: EquivalenceConfig,
) -> Iterator[tuple[Any, Verdict]]:
    """Linear merge of two streams sorted by the reconciliation key."""
    def key_of(row: dict) -> tuple:
        return tuple(row[c] for c in cfg.key_columns)

    s = next(source, None)
    t = next(target, None)
    while s is not None or t is not None:
        sk = key_of(s) if s is not None else None
        tk = key_of(t) if t is not None else None
        if tk is None or (sk is not None and sk < tk):
            yield sk, classify(s, None, cfg)
            s = next(source, None)
        elif sk is None or tk < sk:
            yield tk, classify(None, t, cfg)
            t = next(target, None)
        else:
            yield sk, classify(s, t, cfg)
            s, t = next(source, None), next(target, None)

Key Implementation Notes

Quantize before hashing, not after. The _canon_value decimal branch pins a local decimal context (prec=38, ROUND_HALF_EVEN) and quantizes to the tolerance grid so NUMERIC(38,18), Decimal128, and a DynamoDB Number string collapse to one digest. Comparing with an epsilon after hashing is impossible — the digest has already destroyed the magnitude. See the Python decimal module for context semantics.
Type tags stop cross-type collisions. Each canonical byte string is prefixed (s, n, t, d, a). Without the tag the string "1" and the integer 1 hash equal and mask a real type divergence introduced by a lossy migration.
Field digests localize drift. Folding per-column digests into the row signature means a DRIFT verdict names the exact column, which is what the downstream discrepancy manifest and the threshold tuning for tolerance stage need. A monolithic row hash tells you only that the row changed.
Tri-state nulls. NULL_SENTINEL is a distinct byte value; an empty string canonicalizes to b"s". Relational NULL, a NoSQL missing key, and "" therefore stay separable, preventing false positives that look identical to corruption.
Compliance implication. BLAKE2b and SHA-256 are both collision-resistant and defensible in an audit trail; MD5 is disqualified. Where regulated payloads carry PII, hash inputs are recorded but raw values are not — the field-digest structure lets an audit trail prove a column matched without ever persisting its plaintext.

Verification

Assert the model against known-equivalent and known-divergent pairs before trusting it on production volumes. The first case proves representation-only differences resolve to MATCH; the second proves a genuine value change is caught and attributed.

python

def test_equivalence_model() -> None:
    cfg = EquivalenceConfig(key_columns=("id",))

    # Same business value, three engine representations -> MATCH.
    pg_row = {"id": 1, "amount": Decimal("1.500000000000000000"), "note": "café"}
    mongo_doc = {"id": 1, "amount": Decimal("1.5"), "note": "café"}  # NFD form
    assert classify(pg_row, mongo_doc, cfg) is Verdict.MATCH

    # Divergence beyond tolerance -> DRIFT, attributable to `amount`.
    corrupted = {"id": 1, "amount": Decimal("1.51"), "note": "café"}
    assert classify(pg_row, corrupted, cfg) is Verdict.DRIFT

    # Sub-epsilon float jitter -> TOLERANCE_MATCH, not DRIFT.
    jitter = {"id": 1, "amount": Decimal("1.5000000001"), "note": "café"}
    assert classify(pg_row, jitter, cfg) is Verdict.TOLERANCE_MATCH

    print("equivalence model OK")


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    test_equivalence_model()

For a deterministic replay across environments, run the harness under fixed serialization settings so any residual mismatch is provably a data problem and not an environment one:

bash

PYTHONHASHSEED=0 LC_ALL=C.UTF-8 python -m equivalence.model

If the assertions pass here but rows drift in production, the divergence originates at the storage or ingestion layer, not in the model — take those flagged keys into the byte-level diff described in the parent data equivalence modeling reference.

Operational Considerations

Once the model is correct, the constraints become throughput, memory, and containment. The reconcile merge holds at most one row per stream, so peak memory is bounded by row width, not table size — this is what lets a single worker validate a partition of arbitrary length. Partition the reconciliation key range across workers so each owns a discrete [min_key, max_key) slice with independent checkpointing; a re-run over the same slice must emit an identical verdict stream.

Under partial failure the classifier feeds an explicit fallback ladder rather than aborting the run. Exact signature matches advance immediately; tolerance matches are logged with deviation metrics; unresolved drift is quarantined to a dead-letter queue, and if the unresolved rate crosses the SLO a circuit breaker pauses the pipeline and pages on-call.

Operational guardrails specific to this task:

Connection pooling. Enforce max_overflow=0 on validation workers so a fan-out reconciliation cannot storm the target engine’s connections.
Rate limiting. Apply token-bucket limiting on target reads to avoid throttling during high-concurrency runs, especially against DynamoDB provisioned capacity.
Storage footprint. Emit only (key, verdict, differing_columns), never raw payloads — the verdict stream stays small and PII-free even at billions of rows.
Metrics to expose. reconciliation_records_processed, reconciliation_drift_rate, and reconciliation_latency_p99, with percentile-based alert thresholds; drift rate is the primary signal for the tolerance threshold tuning feedback loop.

By pinning canonicalization, keeping digests field-granular, and routing verdicts through an explicit ladder, the equivalence model holds cross-engine parity at scale while keeping every divergence attributable and every audit input reproducible.

Data Equivalence Modeling — the parent stage this model implements, covering identity, tolerance, and coercion in the round.
Mapping relational schemas to document stores — the structural mapping that decides which fields this model aligns.
How to validate SQL vs NoSQL data parity — the runbook that drives this model across a live migration cutover.
Generating MD5 vs SHA-256 checksums for data rows — the digest-algorithm choice behind the row signature.
Detecting structural mismatches in Parquet files — how the divergence stream this model emits becomes an actionable manifest.

# Building Equivalence Models for Heterogeneous Databases

# Problem Framing: Silent Truncation Across a Migration

# The Type-Coercion Matrix

# Implementation

# Key Implementation Notes

# Verification

# Operational Considerations

# Related

Building Equivalence Models for Heterogeneous Databases

Problem Framing: Silent Truncation Across a Migration

The Type-Coercion Matrix

Implementation

Key Implementation Notes

Verification

Operational Considerations

Related