Structural Mismatch Detection › Detecting Structural Mismatches in Parquet Files

Detecting Structural Mismatches in Parquet Files

Q: Why not just compare row counts after the migration?

A row count proves cardinality and nothing about shape. Every DECIMAL could be widened, every timestamp stripped of its timezone, and columns reordered, and the count still agrees. Parquet structural mismatches surface as aggregation drift or an audit failure downstream, which is why you read the footer and diff the declared schema instead.

Q: How do I read a Parquet schema without loading the data?

Read the footer. pq.ParquetFile(file).schema_arrow and pq.read_metadata(file) decode only the metadata block at the tail of the file — no row groups, no data pages. On object storage that is a single ranged GET of a few kilobytes, so validating a multi-gigabyte partition costs tens of milliseconds and scales with the number of columns, not rows.

Q: An INT32 became INT64 across engines — is that a real mismatch?

Not by itself. INT32 to INT64 and FLOAT to DOUBLE are widenings that lose no information, so they belong in the tolerance map and reconcile as compliant. The dangerous direction is the reverse: a DECIMAL(19,4) narrowed to DECIMAL(18,4) or a DOUBLE demoted to FLOAT truncates silently and must always be a critical mismatch. Encode only the safe widenings so a demotion cannot slip through.

This page answers one precise question: given two Parquet datasets that are supposed to hold the same table — a source written by one engine and a target rewritten by another — how do you prove their shapes agree without reading a single row group? It is the format-specific runbook that sits under the structural mismatch detection reference, and it assumes you already hold a canonical type map from that stage plus the physical type translations fixed by the cross-platform schema mapping reference. What follows turns those contracts into a footer-only validator you can run across a partitioned lake, written for the data engineers, migration specialists, Python pipeline builders, and platform operations teams who own cutover sign-off.

Parquet deserves its own runbook because its structural mismatches almost never announce themselves as read failures. They surface three joins downstream as aggregation drift, as a type-coercion exception in a strict SQL engine, or as a compliance-audit failure — long after the migration was declared “done” because the row counts agreed. The only defensible check reads the file footer, normalizes the declared schema into an engine-agnostic form, and diffs the two shapes before any value-comparison work — the JSON and Parquet diffing engine — is allowed to run.

Problem Framing

You have migrated a 4 TB event table written by a nightly PySpark job into files rewritten by a Trino CREATE TABLE AS SELECT, and a downstream DuckDB mart reads both. Row counts match to the record, so the migration ticket is closed. Then a monthly revenue rollup drifts by fractions of a cent, and a compliance export is quarantined for a schema-contract violation. The causes are all structural and all invisible to a count check: a DECIMAL(18,4) written by Spark landed as a DECIMAL(19,4) after Trino’s precision inference, an INT32 was promoted to INT64 during an append, a TIMESTAMP lost its UTC annotation and became naive, and a SELECT * rewrite reordered two columns so a positional diff reports every file as broken.

The rule that makes this tractable: read footers, not payloads, and normalize before you compare. A Parquet footer is kilobytes; a full file is gigabytes. Extracting the logical schema from the footer lets a single worker validate the shape of a multi-gigabyte partition in tens of milliseconds, and normalizing engine-specific type spellings first is what stops a benign INT32 → INT64 promotion from reading as a fatal mismatch.

The footer is the only bytes read: both files' declared schemas are normalized to an engine-agnostic form, diffed by field name into four buckets, then a verdict gate sorts the result from an exact match through a tolerated widening down to an irreconcilable mismatch.

Implementation

The validator extracts the logical schema from each footer, normalizes both sides, and computes a structured diff that separates critical drift from benign promotions the tolerance profile permits. Cross-engine confirmation and an explicit fallback chain follow. The normalize-then-compare contract is deliberately the same one used by column-level checksum generation, applied to declared types instead of row values.

Read the logical schema without materializing data. PyArrow’s schema_arrow reads only the footer at the tail of the file; no row groups are scanned, so cost scales with schema size, not data volume.

python

import logging
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple

import pyarrow as pa
import pyarrow.parquet as pq

logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", level=logging.INFO
)
logger = logging.getLogger("recon.parquet.structural")


def extract_canonical_schema(file_path: str) -> pa.Schema:
    """Extract the logical Parquet schema from the footer — zero row groups read."""
    try:
        pf = pq.ParquetFile(file_path, memory_map=True)
        # schema_arrow is decoded from the file footer only; no data pages load.
        return pf.schema_arrow
    except FileNotFoundError:
        logger.error("Parquet file not found: %s", file_path)
        raise
    except (pa.ArrowInvalid, OSError) as exc:
        logger.error("Footer read failed for %s: %s", file_path, exc, exc_info=True)
        raise

Step 2 — Compute a tolerance-aware structural diff

A raw schema-equality check fails on every benign promotion. Isolate critical mismatches while applying a configurable tolerance map — the allowed widenings your threshold tuning for tolerance profile declares — and gate the verdict on the fields your equivalence contract marks critical, decided once in data equivalence modeling.

python

@dataclass(frozen=True)
class SchemaDiffReport:
    missing_fields: List[str]                    # in source, absent in target
    extra_fields: List[str]                      # in target, absent in source
    type_mismatches: List[Tuple[str, str, str]]  # (field, source_type, target_type)
    nullability_shifts: List[str]                # required <-> optional flips
    is_compliant: bool


# Widenings that carry no business meaning and must not read as drift.
DEFAULT_TOLERANCE: Dict[str, List[str]] = {
    "int32": ["int64"],
    "float": ["double"],
    "decimal128(18, 4)": ["decimal128(19, 4)"],
}
# Fields whose shape may never drift, even under tolerance.
CRITICAL_FIELDS = frozenset({"transaction_id", "event_timestamp", "user_id"})


def compute_structural_diff(
    src: pa.Schema,
    tgt: pa.Schema,
    tolerance_map: Optional[Dict[str, List[str]]] = None,
) -> SchemaDiffReport:
    """Structural drift between two Parquet schemas, tolerant of benign promotions."""
    tolerance_map = tolerance_map or DEFAULT_TOLERANCE
    src_fields = {f.name: f for f in src}
    tgt_fields = {f.name: f for f in tgt}

    missing = sorted(n for n in src_fields if n not in tgt_fields)
    extra = sorted(n for n in tgt_fields if n not in src_fields)
    type_mismatches: List[Tuple[str, str, str]] = []
    nullability_shifts: List[str] = []

    for name in sorted(src_fields.keys() & tgt_fields.keys()):
        src_f, tgt_f = src_fields[name], tgt_fields[name]
        src_type, tgt_type = str(src_f.type).lower(), str(tgt_f.type).lower()
        if src_type != tgt_type and tgt_type not in tolerance_map.get(src_type, []):
            type_mismatches.append((name, src_type, tgt_type))
        if src_f.nullable != tgt_f.nullable:
            nullability_shifts.append(name)

    mismatched_names = {m[0] for m in type_mismatches}
    critical_drift = any(
        f in missing or f in extra or f in mismatched_names for f in CRITICAL_FIELDS
    )
    report = SchemaDiffReport(
        missing_fields=missing,
        extra_fields=extra,
        type_mismatches=type_mismatches,
        nullability_shifts=nullability_shifts,
        is_compliant=not critical_drift and not type_mismatches,
    )
    if not report.is_compliant:
        logger.warning(
            "structural drift: %d missing, %d extra, %d type-mismatch, %d nullability",
            len(missing), len(extra), len(type_mismatches), len(nullability_shifts),
        )
    return report

Step 3 — Confirm across engines and degrade gracefully

A footer read is authoritative only if it agrees with how the consuming engine will actually interpret the file. Cross-check the extracted shape against DuckDB’s Parquet reader — SQL-native introspection that reads metadata, not rows — and wrap the whole path in an explicit fallback chain so a corrupt footer or a network timeout quarantines one file instead of stalling the run. The tiered halt-and-quarantine response itself is owned by the fallback chain implementation reference.

The chain never crashes on an unreadable footer: each failure escalates one tier deeper — PyArrow, then DuckDB metadata, then a row-sample probe — and any tier that resolves the shape returns the report, while total failure quarantines the single file and halts only its partition.

Priority	Fallback mechanism	Trigger condition	Action and SLA impact
1	PyArrow footer diff	Standard execution	< 2 s per file, zero data read; the primary path.
2	DuckDB SQL introspection	PyArrow raises `ArrowInvalid` or `OSError` on the footer	Isolated DuckDB process reads metadata via `parquet_scan()`; adds ~500 ms.
3	Row-sample schema probe	Both metadata readers fail	Reads a 100-row batch and infers shape from the payload; adds ~2–5 s.
4	Quarantine and alert	All readers time out or disagree	Routes the file to a `quarantine/` prefix, emits a structured alert with the path and last-known-good hash, halts the partition.
Compliance / regulatory	Signed audit record	Any tier past 1	Every degraded verdict is written to append-only audit storage with the tier reached, so a reviewer can prove why a file took the degraded path.

python

def validate_duckdb_parity(file_path: str, expected: pa.Schema) -> bool:
    """Cross-check the extracted shape against DuckDB's Parquet reader (metadata only)."""
    import duckdb

    con = duckdb.connect(":memory:")
    try:
        rows = con.execute(
            "DESCRIBE SELECT * FROM parquet_scan(?)", [file_path]
        ).fetchall()
    finally:
        con.close()
    duckdb_columns = {r[0] for r in rows}
    missing = [f.name for f in expected if f.name not in duckdb_columns]
    if missing:
        logger.warning("DuckDB reader missing columns %s in %s", missing, file_path)
    return not missing


def resilient_schema_diff(file_path: str, expected: pa.Schema) -> SchemaDiffReport:
    """Explicit fallback chain: PyArrow -> DuckDB -> row sample -> quarantine."""
    try:                                              # Tier 1: PyArrow footer
        return compute_structural_diff(expected, extract_canonical_schema(file_path))
    except (pa.ArrowInvalid, OSError) as exc:
        logger.warning("tier 1 footer read failed for %s: %s", file_path, exc)

    try:                                              # Tier 2: DuckDB introspection
        if validate_duckdb_parity(file_path, expected):
            logger.info("tier 2 DuckDB parity passed for %s", file_path)
            return SchemaDiffReport([], [], [], [], is_compliant=True)
    except Exception as exc:                          # noqa: BLE001 - degrade, never crash
        logger.warning("tier 2 DuckDB introspection failed for %s: %s", file_path, exc)

    try:                                              # Tier 3: row-sample probe
        batch = next(pq.ParquetFile(file_path).iter_batches(batch_size=100))
        return compute_structural_diff(expected, batch.schema)
    except (StopIteration, pa.ArrowInvalid, OSError) as exc:
        logger.error("tier 3 row-sample probe failed for %s: %s", file_path, exc)

    logger.critical("all fallbacks exhausted; quarantining %s", file_path)
    raise RuntimeError(f"structural validation exhausted for {file_path}; quarantined")

Key Implementation Notes

Read footers, never row groups — this is the whole point. pq.ParquetFile(file).schema_arrow and pq.read_metadata() decode only the trailing footer, so a shape check costs kilobytes regardless of file size. The moment an extractor is widened to read data, cost scales with volume and the job that was meant to be cheap becomes the bottleneck it was meant to remove.
Type promotion is benign until it truncates. An INT32 → INT64 widening or a FLOAT → DOUBLE promotion loses no information and belongs in the tolerance map; a DECIMAL(19,4) → DECIMAL(18,4) narrowing or a DOUBLE → FLOAT demotion silently truncates and must always be a critical mismatch. Encode the direction explicitly — a tolerance map that lists only widenings is a safety property, not a convenience.
Field order and nullability are policy, not facts. Parquet preserves declared field order, so a positional diff reports a SELECT * reorder as total divergence; compare by name and let the ordering policy decide whether order carries meaning. Likewise a REQUIRED → OPTIONAL flip on an append-only column is usually harmless and on a primary key never is — surface it, and let the tolerance profile classify it.
Timezone annotations are the classic silent Parquet defect. A TIMESTAMP column’s UTC-vs-naive annotation is optional in the format and inconsistently enforced across readers, so a lost timezone passes a naive shape check yet corrupts every temporal join downstream. Treat the timezone flag as part of the type, not a detail to normalize away, whenever the column feeds a time-based aggregation.
A hash match is a compliance decision. When a structural verdict is recorded in an audit trail, back it with a NIST-approved hash of the normalized schema tree rather than a fast non-cryptographic digest; the throughput win of xxHash is not worth an unauditable fingerprint. The full algorithm trade-off lives in the parent structural mismatch detection reference.

Verification

Assert the two properties the validator exists to guarantee: a benign promotion the tolerance map permits reconciles as compliant, and a truncating demotion or a dropped critical field is caught. Field order and metadata differences must not manufacture a mismatch.

python

base = pa.schema([
    pa.field("transaction_id", pa.string(), nullable=False),
    pa.field("amount", pa.int32(), nullable=True),
    pa.field("event_timestamp", pa.timestamp("us", tz="UTC"), nullable=False),
])
# same table, columns reordered, amount widened int32 -> int64 (a tolerated promotion)
promoted = pa.schema([
    pa.field("event_timestamp", pa.timestamp("us", tz="UTC"), nullable=False),
    pa.field("amount", pa.int64(), nullable=True),
    pa.field("transaction_id", pa.string(), nullable=False),
])
assert compute_structural_diff(base, promoted).is_compliant      # reorder + widen reconcile

# amount narrowed to a type that truncates -> must fail
truncated = pa.schema([
    pa.field("transaction_id", pa.string(), nullable=False),
    pa.field("amount", pa.int16(), nullable=True),
    pa.field("event_timestamp", pa.timestamp("us", tz="UTC"), nullable=False),
])
report = compute_structural_diff(base, truncated)
assert not report.is_compliant and report.type_mismatches[0][0] == "amount"
logger.info("parquet structural harness passed")

Run it as a pre-cutover gate: python -m pytest test_parquet_structural.py -q must pass before the migrated dataset is promoted, and a full-lake pass should report zero critical drift across every partition before the target is declared authoritative.

Operational Considerations

At scale this workload is I/O-bound on footer reads and effectively stateless in its diffing, which makes it embarrassingly parallel — the constraint is fan-out, not compute. Distribute extraction across a concurrent.futures.ProcessPoolExecutor or Ray, batch files by partition key to keep footer reads local to a storage node, and for lakes past 100k files run a manifest-driven pass so only newly written partitions are diffed. Cache each footer’s normalized schema hash in a low-latency key-value store keyed by file_path + etag, pre-warmed during off-peak windows, so an incremental load becomes a single point lookup rather than a re-hash. Enable s3fs or gcsfs connection pooling to amortize TLS handshakes, and keep memory_map=True on the reader so the OS page cache serves repeated footer reads.

Expose files_verified_total, critical_drift_rate, footer_read_latency_ms, and fallback_tier_reached so platform ops can alert on P95 latency and drift rate rather than absolute counts, and persist every SchemaDiffReport to an append-only audit table so drift frequency per upstream service is trendable. Run the diff suite against Spark, Trino, and DuckDB using identical Parquet fixtures on a schedule so engine-specific coercion rules stay documented rather than rediscovered mid-incident. Before the first pass, confirm the contracts are in place with the schema validation pre-checks stage, since an unmapped type spelling produces a partition-wide false mismatch that looks exactly like corruption.

Frequently Asked Questions

Why not just compare row counts after the migration?

A row count proves cardinality and nothing about shape. Every DECIMAL could be widened, every timestamp stripped of its timezone, and two columns reordered, and the count still agrees to the record. Parquet structural mismatches surface as aggregation drift or an audit failure three joins downstream, long after a count check has signed off — which is exactly why you read the footer and diff the declared schema instead.

How do I read a Parquet schema without loading the data?

Read the footer. pq.ParquetFile(file).schema_arrow and pq.read_metadata(file) decode only the metadata block at the tail of the file — no row groups, no data pages. On object storage that is a single ranged GET of a few kilobytes, so validating the shape of a multi-gigabyte partition costs tens of milliseconds and scales with the number of columns, not the number of rows.

An INT32 became INT64 across engines — is that a real mismatch?

Not by itself. INT32 → INT64 and FLOAT → DOUBLE are widenings that lose no information, so they belong in the tolerance map and should reconcile as compliant. The dangerous direction is the reverse: a DECIMAL(19,4) narrowed to DECIMAL(18,4), or a DOUBLE demoted to FLOAT, silently truncates and must always be a critical mismatch. Encode only the safe widenings explicitly so a demotion can never slip through as benign.

Structural mismatch detection — the parent reference that defines the canonical type map, structural hash, and tolerance decision chain this Parquet runbook specializes.
JSON and Parquet diffing algorithms — the value-comparison engine that runs only once these footer shapes match.
Threshold tuning for tolerance — sources the widenings and severity thresholds the normalized diff applies.
Fallback chain implementation — the tiered halt, quarantine, and deferred-reconcile response an unreadable footer triggers.
Schema validation pre-checks — the lightweight extraction-time gate that precedes this authoritative footer diff.

# Detecting Structural Mismatches in Parquet Files

# Problem Framing

# Implementation

# Step 1 — Extract the canonical schema from the footer

# Step 2 — Compute a tolerance-aware structural diff

# Step 3 — Confirm across engines and degrade gracefully

# Key Implementation Notes

# Verification

# Operational Considerations

# Frequently Asked Questions

# Related

Detecting Structural Mismatches in Parquet Files

Problem Framing

Implementation

Step 1 — Extract the canonical schema from the footer

Step 2 — Compute a tolerance-aware structural diff

Step 3 — Confirm across engines and degrade gracefully

Key Implementation Notes

Verification

Operational Considerations

Frequently Asked Questions

Related