Data Extraction & Hashing Workflows › Schema Validation Pre-Checks

Schema Validation Pre-Checks for Cross-Engine Reconciliation

Q: Why validate against a pinned contract instead of just comparing the two live schemas?

Comparing source against target only proves they agree with each other, not with what the reconciliation was designed for; both can drift the same way and still be wrong. A pinned, versioned contract records the expected structure independently of either engine, so the gate decision names the exact control it enforced — which is what an audit needs. Live-schema comparison is a fine bootstrap, but promote the result into a contract as soon as you have one.

Q: Should the pre-check ever read table rows to confirm a type?

No. Issuing a SELECT against the table couples latency to dataset size and stops it being a cheap gate. Every fact it needs — column names, logical types, nullability, precision, ordinal position — lives in the catalog, so keep it metadata-only and a ten-billion-row table costs the same as a ten-row one.

Q: How do I decide which drift is acceptable and which halts the run?

Encode it in an explicit tolerance policy. Nullable widening and precision or scale increase are typically non-breaking and warrant a WARN with telemetry; type changes, missing columns, timezone stripping, and precision loss silently corrupt digests and must FAIL. Any change that could make two logically equal rows serialize to different bytes, or hide a truncation, is breaking.

Schema validation pre-checks are the gate that decides whether a reconciliation run is even allowed to start. Before a single row is fetched or hashed, this stage compares the live structural contract of the source and target engines — column set, logical types, nullability, precision, and ordering rules — against a pinned expectation, and it halts the pipeline the moment a breaking difference appears. It is the cheapest stage in the whole data extraction and hashing workflows sequence and the one with the highest leverage: the computational cost of hashing and comparing billions of rows is only justified when the underlying contracts are guaranteed, and a mismatch caught here costs one catalog query instead of a full run’s worth of compute and a fog of false positives. This reference is written for data engineers, migration specialists, Python pipeline builders, and platform operations teams who need a pre-check that is fast, deterministic, and defensible in an audit.

The problem the pre-check solves is that “the same table” rarely means the same structure across two engines. Snowflake reports NUMBER(38,0) where BigQuery reports INT64; a VARCHAR on one side is STRING on the other; a timestamp is timezone-aware here and naive there; a column that was nullable at design time silently became non-nullable after a backfill. Left undetected, every one of those differences either aborts the job halfway through or, worse, produces digests that disagree for rows that are actually equivalent. This page defines where the pre-check begins and ends, walks a production-grade implementation step by step, weighs the validation strategies against each other, and gives a diagnostic runbook for the drift failures that recur across heterogeneous migrations.

Architectural Boundaries: What This Stage Consumes and Produces

The pre-check operates strictly at the metadata layer, completely isolated from payload processing. It consumes a read-only catalog connection to each engine and a canonical schema contract — a pinned, versioned description of the columns the reconciliation expects. It produces a structured diff report and a single gate decision (PASS, WARN, or FAIL) plus per-run telemetry that an orchestrator can route on. It begins the instant a run is scheduled and ends before the first cursor is opened by the parallel row extraction techniques stage.

The defining constraint is that this stage must never scan data. It queries catalog endpoints — INFORMATION_SCHEMA on BigQuery and PostgreSQL, ACCOUNT_USAGE/INFORMATION_SCHEMA on Snowflake, the Glue Data Catalog, or the Hive Metastore — rather than issuing SELECT * or reading files. That boundary keeps latency O(1) with respect to dataset size and keeps the stage stateless and idempotent: a re-run over the same contract and the same catalog snapshot yields an identical report. Where the definition of a logical column and its cross-engine alignment lives is upstream, in the cross-platform schema mapping reference; the pre-check consumes that mapping as its contract and enforces it, it does not author it.

Downstream, the gate decision has direct consequences. On PASS or an approved WARN, the validated column list becomes the projection predicate and partition bound for parallel row extraction techniques, and the type-alignment guarantees flow into column-level checksum generation so that identical logical rows serialize to identical bytes. On FAIL, the pipeline stops here — no extraction, no hashing, no comparison — because a structural mismatch fed forward would surface later as structural mismatch detection noise that no downstream algorithm can distinguish from real corruption. The semantics of what “equivalent” means for each column — whether trailing whitespace matters, how NULL collapses — belong to data equivalence modeling; the pre-check validates structure, the equivalence model validates meaning.

Prerequisites

Confirm every item below before wiring the pre-check into a reconciliation run. Each unchecked box is a class of silent drift waiting to surface after cutover.

Canonical contract pinned and versioned. A single authoritative schema contract exists in a registry (Git, a schema registry, or a contracts/ table), tagged with a version, so both engines are validated against the same expectation.
Read-only catalog access on both engines. The reconciliation identity holds SELECT on INFORMATION_SCHEMA / catalog views only — never table data grants, and never write grants.
Cross-engine type map agreed. The normalization rules that collapse VARCHAR/STRING/TEXT and NUMBER/DECIMAL/NUMERIC to canonical types are captured from the cross-platform schema mapping reference and pinned alongside the contract.
Tolerance policy defined. The team has explicitly decided which deviations are acceptable (nullable widening, precision increase, reordering) and which are breaking (type downgrade, missing key, timezone stripping).
Dependency libraries pinned. Python 3.11+, the source and target DB-API drivers, and any registry client are pinned so introspection behaviour is reproducible across hosts.
Telemetry sink wired. A structured-log or metrics destination is available so every gate decision (contract version, drift delta, timestamp) is recorded for the audit trail.

Step-by-Step Implementation

The steps below build a stateless, idempotent pre-check: define the contract, introspect live catalog metadata, diff with a tolerance policy, and emit a gate decision. Each step ends with an assertion or observable output so it can be verified before the next is layered on. The design isolates validation logic from I/O — the diff is a pure function over two schema descriptions, which is what makes it deterministic and unit-testable without a live database.

Step 1 — Model the canonical contract and normalize types

Every value type from every engine must fold to a canonical representation before it can be compared. Interpolating raw engine type strings into an equality check is exactly the drift the stage exists to eliminate, so normalization is the first thing to get right.

python

import logging
from dataclasses import dataclass, field
from enum import Enum
from typing import Dict, List, Optional

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("reconcile.schema")


class GateStatus(str, Enum):
    PASS = "PASS"
    WARN = "WARN"
    FAIL = "FAIL"


@dataclass(frozen=True)
class ColumnContract:
    name: str
    dtype: str            # canonical type, e.g. "STRING", "INT64", "DECIMAL"
    nullable: bool
    precision: Optional[int] = None
    scale: Optional[int] = None


@dataclass(frozen=True)
class SchemaContract:
    version: str
    columns: List[ColumnContract]
    strict_order: bool = False


# Cross-engine type aliases → canonical representation (extend per platform).
TYPE_NORMALIZATION_MAP: Dict[str, str] = {
    "VARCHAR": "STRING", "TEXT": "STRING", "NVARCHAR": "STRING", "CHAR": "STRING",
    "INT": "INT32", "INTEGER": "INT32", "SMALLINT": "INT32",
    "BIGINT": "INT64", "LONG": "INT64",
    "FLOAT": "FLOAT64", "DOUBLE": "FLOAT64", "REAL": "FLOAT64", "FLOAT8": "FLOAT64",
    "NUMBER": "DECIMAL", "NUMERIC": "DECIMAL", "DECIMAL": "DECIMAL",
    "BOOLEAN": "BOOL", "BOOL": "BOOL", "DATE": "DATE",
    "TIMESTAMP": "DATETIME", "TIMESTAMP_NTZ": "DATETIME",
    "TIMESTAMP_TZ": "TIMESTAMP_TZ", "TIMESTAMPTZ": "TIMESTAMP_TZ",
}


def normalize_dtype(raw_dtype: str) -> str:
    """Collapse an engine-specific type string to its canonical form.

    Strips the precision/scale suffix (NUMBER(38,0) -> NUMBER) before mapping,
    so precision is compared separately and never as part of the type name.
    """
    base = raw_dtype.strip().upper().split("(")[0].strip()
    return TYPE_NORMALIZATION_MAP.get(base, base)

Verify the normalizer collapses aliases but keeps a genuinely distinct type distinct — the single most common source of cross-engine false positives:

python

assert normalize_dtype("VARCHAR(255)") == normalize_dtype("STRING")
assert normalize_dtype("NUMBER(38,0)") == "DECIMAL"
assert normalize_dtype("TIMESTAMP_TZ") != normalize_dtype("TIMESTAMP")  # tz-stripping is breaking
print("type normalization: OK")

Step 2 — Introspect live catalog metadata (no data scan)

Read the live schema from the catalog, not from the data. The query below targets an INFORMATION_SCHEMA.COLUMNS view; swap the statement for the target engine’s catalog while keeping the same ColumnContract output shape, so the diff stays engine-agnostic.

python

from contextlib import contextmanager
from typing import Any, Iterator

# psycopg2 shown; substitute the target engine's DB-API driver as needed.
import psycopg2
from psycopg2.extras import RealDictCursor


@contextmanager
def read_only_catalog(dsn: str) -> Iterator[Any]:
    conn = psycopg2.connect(dsn, cursor_factory=RealDictCursor)
    conn.set_session(readonly=True)
    try:
        yield conn
    finally:
        conn.close()


def introspect_schema(dsn: str, table: str) -> List[ColumnContract]:
    """Read a table's live column metadata from INFORMATION_SCHEMA — metadata only."""
    query = """
        SELECT column_name, data_type, is_nullable,
               numeric_precision, numeric_scale
        FROM information_schema.columns
        WHERE table_name = %s
        ORDER BY ordinal_position
    """
    live: List[ColumnContract] = []
    with read_only_catalog(dsn) as conn:
        with conn.cursor() as cur:
            cur.execute(query, (table,))
            for r in cur:
                live.append(
                    ColumnContract(
                        name=r["column_name"],
                        dtype=normalize_dtype(r["data_type"]),
                        nullable=(r["is_nullable"] == "YES"),
                        precision=r["numeric_precision"],
                        scale=r["numeric_scale"],
                    )
                )
    logger.info("Introspected %d columns from '%s'", len(live), table)
    return live

Because the read touches only catalog views, its latency is independent of table size — a 10-row table and a 10-billion-row table cost the same.

Step 3 — Diff live schema against the contract with a tolerance policy

The diff is a pure function: contract in, live schema in, structured report out. Tolerance rules distinguish drift that can proceed (nullable widening, precision increase) from drift that must halt the run (missing column, type change, tz stripping, precision loss).

python

@dataclass
class SchemaDiffReport:
    status: GateStatus = GateStatus.PASS
    contract_version: str = ""
    missing_columns: List[str] = field(default_factory=list)
    unexpected_columns: List[str] = field(default_factory=list)
    type_mismatches: List[Dict[str, str]] = field(default_factory=list)
    nullability_violations: List[str] = field(default_factory=list)
    precision_drifts: List[str] = field(default_factory=list)
    message: str = ""


@dataclass(frozen=True)
class TolerancePolicy:
    allow_nullable_widening: bool = True     # live nullable, contract required -> WARN
    allow_precision_increase: bool = True    # live scale >= contract scale -> WARN
    allow_reordering: bool = True            # honoured only when strict_order is False


def diff_schema(
    contract: SchemaContract,
    live: List[ColumnContract],
    policy: TolerancePolicy = TolerancePolicy(),
) -> SchemaDiffReport:
    """Compare a live schema against the contract. Pure, deterministic, no I/O."""
    report = SchemaDiffReport(contract_version=contract.version)
    live_by_name = {c.name: c for c in live}
    contract_names = {c.name for c in contract.columns}

    report.missing_columns = sorted(contract_names - live_by_name.keys())
    report.unexpected_columns = sorted(live_by_name.keys() - contract_names)
    if report.missing_columns:
        report.status = GateStatus.FAIL

    if contract.strict_order:
        live_order = [c.name for c in live if c.name in contract_names]
        contract_order = [c.name for c in contract.columns if c.name in live_by_name]
        if live_order != contract_order:
            report.status = GateStatus.FAIL
            report.message = "Column ordering violates strict_order contract."

    for expected in contract.columns:
        actual = live_by_name.get(expected.name)
        if actual is None:
            continue

        if actual.dtype != expected.dtype:
            report.type_mismatches.append(
                {"column": expected.name, "expected": expected.dtype, "actual": actual.dtype}
            )
            report.status = GateStatus.FAIL  # type change is always breaking

        # Nullability: live allowing NULL where the contract requires a value.
        if actual.nullable and not expected.nullable:
            if policy.allow_nullable_widening:
                report.nullability_violations.append(f"{expected.name} (widened to nullable)")
                report.status = max(report.status, GateStatus.WARN, key=_severity)
            else:
                report.nullability_violations.append(f"{expected.name} (nullable not permitted)")
                report.status = GateStatus.FAIL

        # Precision/scale drift on decimals.
        if expected.scale is not None and actual.scale is not None:
            if actual.scale < expected.scale:
                report.precision_drifts.append(f"{expected.name} (scale {actual.scale} < {expected.scale})")
                report.status = GateStatus.FAIL  # precision loss truncates silently
            elif actual.scale > expected.scale and policy.allow_precision_increase:
                report.precision_drifts.append(f"{expected.name} (scale widened)")
                report.status = max(report.status, GateStatus.WARN, key=_severity)

    if not report.message:
        report.message = {
            GateStatus.PASS: "Schema contract validated.",
            GateStatus.WARN: "Non-breaking drift detected; proceeding with telemetry.",
            GateStatus.FAIL: "Schema contract violated; halting pipeline.",
        }[report.status]
    return report


_SEVERITY = {GateStatus.PASS: 0, GateStatus.WARN: 1, GateStatus.FAIL: 2}


def _severity(s: GateStatus) -> int:
    return _SEVERITY[s]

Assert that a widened-nullable column warns but does not fail, while a type change fails — the two decisions that route the run:

python

contract = SchemaContract(
    version="v3",
    columns=[
        ColumnContract("id", "INT64", nullable=False),
        ColumnContract("amount", "DECIMAL", nullable=False, scale=2),
    ],
)
live_ok = [ColumnContract("id", "INT64", True), ColumnContract("amount", "DECIMAL", False, scale=2)]
live_bad = [ColumnContract("id", "STRING", False), ColumnContract("amount", "DECIMAL", False, scale=2)]

assert diff_schema(contract, live_ok).status is GateStatus.WARN     # nullable widening only
assert diff_schema(contract, live_bad).status is GateStatus.FAIL    # id INT64 -> STRING
print("tolerance-aware diff: OK")

Step 4 — Gate the pipeline and emit telemetry

Wrap introspection and diff behind a single gate function that both engines pass through, log the structured decision for the audit trail, and return a boolean the orchestrator can branch on. Catalog timeouts and permanent contract violations are handled distinctly — the former is retryable, the latter is not.

python

def run_pre_check(
    source_dsn: str, target_dsn: str, table: str, contract: SchemaContract
) -> bool:
    """Gate the run on structural parity. Returns True if extraction may proceed."""
    try:
        source_report = diff_schema(contract, introspect_schema(source_dsn, table))
        target_report = diff_schema(contract, introspect_schema(target_dsn, table))
    except psycopg2.OperationalError as exc:
        # Transient catalog failure — caller should retry with backoff, not abort.
        logger.warning("Catalog introspection failed (retryable): %s", exc)
        raise

    for side, report in (("source", source_report), ("target", target_report)):
        logger.info(
            "gate side=%s version=%s status=%s missing=%d type_mismatch=%d",
            side, report.contract_version, report.status.value,
            len(report.missing_columns), len(report.type_mismatches),
        )

    worst = max((source_report.status, target_report.status), key=_severity)
    if worst is GateStatus.FAIL:
        logger.error("Pre-check FAILED for '%s'; extraction blocked.", table)
        return False
    return True

Validation Strategy Trade-offs

Three strategies dominate production pre-checks, and the choice drives both how early drift is caught and how much trust the decision carries in an audit. The compliance row matters because the pre-check is frequently the first documented control in a migration’s evidence chain, and a strategy that cannot name the contract it validated against is hard to defend.

Criterion	Pinned contract diff (registry-backed)	Live catalog introspection diff	Runtime sampling inference
Latency	O(1) — one registry read + catalog query	O(1) — two catalog queries, no data	O(sample) — reads real rows, scales with sample size
Boundary cost	Cheap; contract fetched once per run	Cheap; `INFORMATION_SCHEMA` seek	Expensive; a live `SELECT` on both engines
Drift detection	Detects any deviation from the pinned expectation	Detects source↔target divergence, not intent	Detects only what the sample happens to expose
Compliance / regulatory	Strongest — the exact contract version is recorded as the control that gated the run	Adequate; records both live schemas but no pinned intent	Weakest — sampled inference is non-deterministic and hard to audit
Scale ceiling	Excellent — independent of table size	Excellent — independent of table size	Poor — sample must grow to stay representative
Best fit	Regulated migrations needing an immutable audit trail	Ad-hoc parity checks where no contract exists yet	Reverse-engineering an undocumented legacy schema

For any regulated or auditable reconciliation, the pinned-contract strategy is the default: it is the only one that records what the run was expected to look like, not merely what it happened to see. Fall back to live catalog introspection when no contract has been authored yet — it is the fastest way to bootstrap one — and reserve runtime sampling for the narrow case of reverse-engineering a schema no one documented, then promote the inferred structure into a pinned contract so subsequent runs get the stronger control. In all three cases the diff itself stays a pure function; only the source of the “expected” schema changes.

Scaling and Performance

The pre-check’s cost is fixed with respect to data volume, which is the entire point — but it still has to behave under many concurrent runs and flaky catalog endpoints.

Metadata-only invariant. Never let a pre-check reach for row data to “confirm” a type. The moment it issues a SELECT against the table, its latency couples to dataset size and it stops being a cheap gate. Every fact it needs lives in the catalog; keep it there.

Catalog caching. INFORMATION_SCHEMA reads are cheap but not free, and Snowflake’s ACCOUNT_USAGE views carry latency and quota. Cache the introspected schema for a short TTL keyed by (engine, table, catalog_snapshot) so a fan-out of hundreds of table-level runs does not stampede the metadata service; invalidate on any DDL event.

Concurrency and rate limits. Cloud catalogs rate-limit metadata APIs. Bound the number of concurrent introspection calls and treat 429/throttle responses as transient — exponential backoff with a circuit breaker, not an immediate FAIL. A rate-limited catalog is an availability problem, not a schema violation, and conflating the two aborts healthy runs.

Statelessness for horizontal fan-out. Because the diff is pure and the introspection read-only, thousands of table-level pre-checks run in parallel with no shared state. Partition the work by table and let each run own its own gate decision; there is no coordination cost to scale out, unlike the parallel row extraction techniques it precedes.

Failure Modes and Diagnostic Runbook

Each named failure mode below lists its cause, the signal that detects it, and the remediation.

Silent type downgrade. Cause: a column changed engine type across a migration — TIMESTAMP_TZ became TIMESTAMP, or DECIMAL(12,4) became DECIMAL(12,2). Signal: the diff reports a type_mismatch or a negative scale drift; downstream, mismatches would concentrate in one typed column. Remediation: treat as breaking, halt, and reconcile the type at the source before re-running — never loosen normalization to make the mismatch disappear.
Phantom parity from over-eager normalization. Cause: the type map collapsed two genuinely distinct types (e.g. mapping TIMESTAMP_TZ to the same canonical as TIMESTAMP). Signal: the pre-check passes but column-level checksum generation then produces mismatches on temporal columns. Remediation: split the offending alias in TYPE_NORMALIZATION_MAP so timezone-bearing and naive types normalize distinctly.
Catalog timeout mistaken for schema failure. Cause: a rate-limited or slow ACCOUNT_USAGE/INFORMATION_SCHEMA read raises inside the gate. Signal: intermittent FAIL decisions correlated with catalog latency spikes, not with any DDL. Remediation: catch the transient error distinctly (Step 4), retry with exponential backoff behind a circuit breaker, and only FAIL on a genuine contract violation.
Column reordering false alarm. Cause: strict_order=True on a contract where ordinal position is not semantically meaningful. Signal: a FAIL whose only finding is an order difference on an otherwise-identical column set. Remediation: set strict_order=False unless positional access (e.g. COPY by ordinal) actually depends on order.
Stale contract drift. Cause: an intentional, approved schema evolution shipped but the pinned contract was never bumped. Signal: a wave of identical WARN/FAIL reports across every run of one table immediately after a release. Remediation: route approved drift to the automating schema drift validation before hashing flow, which applies dynamic casting and projection, then promote the new structure into a fresh contract version.
Nullability trap after backfill. Cause: a column the contract marks required became nullable (or vice versa) during a data backfill. Signal: a nullability_violation in the report even though every value currently present is non-null. Remediation: decide per policy — widening is usually a WARN, but a required key going nullable should FAIL and route to the security boundaries for reconciliation review if the column is an identity or audit field.

The discipline that keeps the gate trustworthy: a FAIL is a signal to reconcile structure at the source, never a prompt to relax the contract until the run turns green.

In This Reference

This pre-check model is developed further in a dedicated companion reference:

The automating schema drift validation before hashing guide turns the manual gate into an automated one: it applies dynamic type casting and column projection to acceptable drift so reconciliation continues without a human in the loop, while still preserving the audit record.

Frequently Asked Questions

Why validate against a pinned contract instead of just comparing the two live schemas?

Comparing source against target only tells you they agree with each other — it cannot tell you they agree with what the reconciliation was designed for. Two engines can both drift in the same direction and still be wrong. A pinned, versioned contract records the expected structure independently of either engine, so the gate decision names the exact control it enforced, which is what an audit needs. Live-schema comparison is a fine bootstrap when no contract exists yet, but promote the result into a contract as soon as you have one.

Should the pre-check ever read table rows to confirm a type?

No. The moment it issues a SELECT against the table, its latency couples to dataset size and it stops being a cheap gate. Every fact it needs — column names, logical types, nullability, precision, ordinal position — lives in the catalog (INFORMATION_SCHEMA, ACCOUNT_USAGE, the Glue/Hive metastore). Keep it metadata-only so a ten-billion-row table costs the same as a ten-row one.

How do I decide which drift is acceptable and which halts the run?

Encode it in an explicit tolerance policy rather than in ad-hoc conditionals. Nullable widening and precision/scale increase are typically non-breaking and warrant a WARN with telemetry; type changes, missing columns, timezone stripping, and precision loss silently corrupt digests and must FAIL. The rule of thumb: any change that could make two logically equal rows serialize to different bytes — or hide a truncation — is breaking.

What happens when the catalog API is rate-limited mid-run?

Treat it as an availability problem, not a schema violation. A throttled or timed-out metadata read must be caught distinctly from a contract violation, retried with exponential backoff behind a circuit breaker, and only escalated after the retry budget is exhausted. Returning FAIL on a transient catalog error aborts healthy runs and pollutes the drift metrics with noise.

Data extraction & hashing workflows — the stage overview this pre-check gates the entry to.
Parallel row extraction techniques — the extraction tier that only opens a cursor once this gate returns PASS.
Column-level checksum generation — the digesting stage whose determinism depends on the type parity this gate enforces.
Cross-platform schema mapping — where the canonical contract and cross-engine type rules are authored.
Automating schema drift validation before hashing — turning acceptable drift into automated casting and projection.

For the introspection and typing primitives used above, consult Python’s DB-API 2.0 specification (PEP 249) and the dataclasses reference.

# Schema Validation Pre-Checks for Cross-Engine Reconciliation

# Architectural Boundaries: What This Stage Consumes and Produces

# Prerequisites

# Step-by-Step Implementation

# Step 1 — Model the canonical contract and normalize types

# Step 2 — Introspect live catalog metadata (no data scan)

# Step 3 — Diff live schema against the contract with a tolerance policy

# Step 4 — Gate the pipeline and emit telemetry

# Validation Strategy Trade-offs

# Scaling and Performance

# Failure Modes and Diagnostic Runbook

# In This Reference

# Frequently Asked Questions

# Related

Schema Validation Pre-Checks for Cross-Engine Reconciliation

Architectural Boundaries: What This Stage Consumes and Produces

Prerequisites

Step-by-Step Implementation

Step 1 — Model the canonical contract and normalize types

Step 2 — Introspect live catalog metadata (no data scan)

Step 3 — Diff live schema against the contract with a tolerance policy

Step 4 — Gate the pipeline and emit telemetry

Validation Strategy Trade-offs

Scaling and Performance

Failure Modes and Diagnostic Runbook

In This Reference

Frequently Asked Questions

Related