Demo 02: Medallion Architecture Bronze → Silver → Gold Data Pipeline

Last updated on 01 Mar 2026

Platform: Trino + Iceberg + Delta Lake + MinIO (S3-compatible)
Dataset: NYC FHV High-Volume Trip Data (2026-01)
Goal: Implement a production-style medallion architecture that progressively refines raw data into analytics-ready Gold tables, comparing Iceberg vs. Delta at each layer.

1. Context & Learning Objectives

In Experiment 01 you established the ingestion foundation — raw Parquet files landed in MinIO, then registered as a Hive external table and ingested into both an Iceberg Bronze table and a Delta Bronze table with platform-level metadata columns (_ingest_timestamp, _source_prefix, _ingestion_id).

Experiment 02 builds the transformation pipeline on top of that foundation. By the end you will understand:

Why each medallion layer has a different schema contract and quality guarantee
How to implement idempotent MERGE (upsert) logic in both Iceberg and Delta via Trino
How partition pruning, hidden partitioning (Iceberg) vs. explicit partitioning (Delta) affect query performance
How to implement Silver-layer data quality checks as SQL assertions before promoting data to Gold
How to design a Gold aggregation table that a BI tool or ML feature store could consume directly
The operational difference between Iceberg's snapshot isolation and Delta's transaction log when doing incremental loads

2. Medallion Layer Definitions (this project)

Layer	Location	Guarantee	Schema evolution	Partition
Bronze	`s3a://demo/bronze/`	Raw copy + metadata; no deduplication	Append-only	`day(pickup_datetime)`
Silver	`s3a://demo/silver/`	Deduplicated, validated, typed, enriched	Evolving (add columns OK)	`day(pickup_datetime)`
Gold	`s3a://demo/gold/`	Business aggregates, SLA-grade quality	Stable (breaking changes → new table)	`pickup_date` (DATE)

3. Architecture Diagram

MinIO (s3a://demo/)
│
├── landing/nyc_taxi/2026/02/26/   ← Parquet files (immutable, raw)
│
├── bronze/
│   ├── nyc_fhvhv_iceberg/         ← iceberg.bronze.fhvhv_tripdata_raw
│   └── nyc_fhvhv_delta/           ← delta.bronze.fhvhv_tripdata_raw_delta
│
├── silver/
│   ├── nyc_fhvhv_iceberg/         ← iceberg.silver.fhvhv_tripdata_clean
│   └── nyc_fhvhv_delta/           ← delta.silver.fhvhv_tripdata_clean_delta
│
└── gold/
    ├── nyc_fhvhv_iceberg/         ← iceberg.gold.fhvhv_daily_summary
    └── nyc_fhvhv_delta/           ← delta.gold.fhvhv_daily_summary_delta

4. Experiment Steps

Step 0 — Verify Bronze Layer (Pre-flight)

Before building Silver, confirm the Bronze tables from Experiment 01 are populated and healthy.

-- Check Iceberg Bronze row count and partition distribution
SELECT
    day(pickup_datetime)    AS pickup_day,
    COUNT(*)                AS row_count,
    COUNT(DISTINCT _ingestion_id) AS unique_rows,
    MIN(_ingest_timestamp)  AS first_ingested,
    MAX(_ingest_timestamp)  AS last_ingested
FROM iceberg.bronze.fhvhv_tripdata_raw
GROUP BY day(pickup_datetime)
ORDER BY pickup_day;

-- Quick sanity check: any NULLs in key columns?
SELECT
    COUNT_IF(pickup_datetime IS NULL)    AS null_pickup,
    COUNT_IF(dropoff_datetime IS NULL)   AS null_dropoff,
    COUNT_IF(trip_miles <= 0)            AS non_positive_miles,
    COUNT_IF(base_passenger_fare < 0)    AS negative_fare
FROM iceberg.bronze.fhvhv_tripdata_raw;

Expected outcome: You should see rows distributed across multiple days within the partition. Note down the total row count — you will compare it to Silver after deduplication.

Step 1 — Create Silver Schema and Tables

Silver applies three transformations on top of Bronze: (1) deduplication using _ingestion_id, (2) validation filtering out rows that violate business rules, and (3) enrichment with computed columns like trip_duration_minutes and fare_per_mile.

1a — Iceberg Silver Table

CREATE SCHEMA IF NOT EXISTS iceberg.silver;

DROP TABLE IF EXISTS iceberg.silver.fhvhv_tripdata_clean;

CREATE TABLE iceberg.silver.fhvhv_tripdata_clean (
    -- business key (natural dedup key)
    trip_id                  VARCHAR,          -- SHA256 of (license+pickup_dt+dropoff_dt+pu+do)

    -- core trip fields
    hvfhs_license_num        VARCHAR,
    dispatching_base_num     VARCHAR,
    originating_base_num     VARCHAR,
    request_datetime         TIMESTAMP(6),
    on_scene_datetime        TIMESTAMP(6),
    pickup_datetime          TIMESTAMP(6),
    dropoff_datetime         TIMESTAMP(6),
    PULocationID             INTEGER,
    DOLocationID             INTEGER,

    -- raw metrics
    trip_miles               DOUBLE,
    trip_time                BIGINT,
    base_passenger_fare      DOUBLE,
    tolls                    DOUBLE,
    bcf                      DOUBLE,
    sales_tax                DOUBLE,
    congestion_surcharge     DOUBLE,
    airport_fee              DOUBLE,
    tips                     DOUBLE,
    driver_pay               DOUBLE,
    cbd_congestion_fee       DOUBLE,

    -- boolean flags (converted from VARCHAR 'Y'/'N')
    is_shared_request        BOOLEAN,
    is_shared_match          BOOLEAN,
    is_access_a_ride         BOOLEAN,
    is_wav_request           BOOLEAN,
    is_wav_match             BOOLEAN,

    -- Silver enrichment columns
    trip_duration_minutes    DOUBLE,           -- derived: (dropoff - pickup) in minutes
    fare_per_mile            DOUBLE,           -- derived: base_passenger_fare / trip_miles
    total_amount             DOUBLE,           -- derived: sum of all fare components

    -- data quality metadata
    dq_passed                BOOLEAN,          -- did this row pass all Silver DQ rules?
    dq_flags                 VARCHAR,          -- pipe-delimited list of failed rules (or NULL)

    -- lineage
    _bronze_ingestion_id     VARCHAR,          -- FK back to Bronze _ingestion_id
    _silver_timestamp        TIMESTAMP(6)      -- when Silver processing ran
)
WITH (
    format       = 'PARQUET',
    location     = 's3a://demo/silver/nyc_fhvhv_iceberg/',
    partitioning = ARRAY['day(pickup_datetime)']
);

1b — Delta Silver Table

CREATE SCHEMA IF NOT EXISTS delta.silver
WITH (location = 's3a://demo/silver/_schemas/silver/');

DROP TABLE IF EXISTS delta.silver.fhvhv_tripdata_clean_delta;

CREATE TABLE delta.silver.fhvhv_tripdata_clean_delta (
    trip_id                  VARCHAR,
    hvfhs_license_num        VARCHAR,
    dispatching_base_num     VARCHAR,
    originating_base_num     VARCHAR,
    request_datetime         TIMESTAMP(6),
    on_scene_datetime        TIMESTAMP(6),
    pickup_datetime          TIMESTAMP(6),
    dropoff_datetime         TIMESTAMP(6),
    PULocationID             INTEGER,
    DOLocationID             INTEGER,
    trip_miles               DOUBLE,
    trip_time                BIGINT,
    base_passenger_fare      DOUBLE,
    tolls                    DOUBLE,
    bcf                      DOUBLE,
    sales_tax                DOUBLE,
    congestion_surcharge     DOUBLE,
    airport_fee              DOUBLE,
    tips                     DOUBLE,
    driver_pay               DOUBLE,
    cbd_congestion_fee       DOUBLE,
    is_shared_request        BOOLEAN,
    is_shared_match          BOOLEAN,
    is_access_a_ride         BOOLEAN,
    is_wav_request           BOOLEAN,
    is_wav_match             BOOLEAN,
    trip_duration_minutes    DOUBLE,
    fare_per_mile            DOUBLE,
    total_amount             DOUBLE,
    dq_passed                BOOLEAN,
    dq_flags                 VARCHAR,
    _bronze_ingestion_id     VARCHAR,
    _silver_timestamp        TIMESTAMP(6)
)
WITH (
    location = 's3a://demo/silver/nyc_fhvhv_delta/'
);

Step 2 — Populate Silver via Bronze → Silver Transformation

This INSERT encapsulates the Bronze-to-Silver logic. Notice the CTE pattern: it is intentional. Breaking the transformation into named CTEs makes the logic readable, testable, and straightforward to convert into a dbt model or Airflow SQL task later.

-- ============================================================
-- Bronze → Silver: Iceberg version
-- Run this after Step 1a.
-- ============================================================
INSERT INTO iceberg.silver.fhvhv_tripdata_clean

WITH

-- CTE 1: Deduplicate Bronze on natural business key.
-- If the same trip was ingested twice (e.g. re-run without truncate),
-- keep only the FIRST occurrence by _ingest_timestamp.
deduped AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY hvfhs_license_num,
                            cast(pickup_datetime  AS VARCHAR),
                            cast(dropoff_datetime AS VARCHAR),
                            cast(PULocationID     AS VARCHAR),
                            cast(DOLocationID     AS VARCHAR)
               ORDER BY _ingest_timestamp ASC
           ) AS rn
    FROM iceberg.bronze.fhvhv_tripdata_raw
),

-- CTE 2: Retain only the first copy of each trip.
unique_trips AS (
    SELECT * FROM deduped WHERE rn = 1
),

-- CTE 3: Compute derived columns and convert flags.
enriched AS (
    SELECT
        -- trip_id: deterministic hash as the Silver business key
        to_hex(md5(to_utf8(
            hvfhs_license_num ||
            cast(pickup_datetime  AS VARCHAR) ||
            cast(dropoff_datetime AS VARCHAR) ||
            cast(PULocationID     AS VARCHAR) ||
            cast(DOLocationID     AS VARCHAR)
        )))                                                         AS trip_id,

        hvfhs_license_num,
        dispatching_base_num,
        originating_base_num,
        request_datetime,
        on_scene_datetime,
        pickup_datetime,
        dropoff_datetime,
        PULocationID,
        DOLocationID,
        trip_miles,
        trip_time,
        base_passenger_fare,
        tolls,
        bcf,
        sales_tax,
        congestion_surcharge,
        airport_fee,
        tips,
        driver_pay,
        cbd_congestion_fee,

        -- flag conversion: 'Y' → TRUE, anything else → FALSE
        (shared_request_flag = 'Y') AS is_shared_request,
        (shared_match_flag   = 'Y') AS is_shared_match,
        (access_a_ride_flag  = 'Y') AS is_access_a_ride,
        (wav_request_flag    = 'Y') AS is_wav_request,
        (wav_match_flag      = 'Y') AS is_wav_match,

        -- derived metrics
        date_diff('second', pickup_datetime, dropoff_datetime) / 60.0
                                                                    AS trip_duration_minutes,
        CASE WHEN trip_miles > 0
             THEN base_passenger_fare / trip_miles
             ELSE NULL
        END                                                         AS fare_per_mile,

        COALESCE(base_passenger_fare, 0)
          + COALESCE(tolls, 0)
          + COALESCE(bcf, 0)
          + COALESCE(sales_tax, 0)
          + COALESCE(congestion_surcharge, 0)
          + COALESCE(airport_fee, 0)
          + COALESCE(tips, 0)
          + COALESCE(cbd_congestion_fee, 0)                        AS total_amount,

        _ingestion_id                                               AS _bronze_ingestion_id,
        current_timestamp                                           AS _silver_timestamp

    FROM unique_trips
),

-- CTE 4: Apply data quality rules and annotate each row.
-- This pattern (annotate rather than filter) is deliberate:
-- you keep DQ-failed rows visible in Silver so you can audit them,
-- but Gold will filter to dq_passed = TRUE only.
dq_annotated AS (
    SELECT
        *,

        -- Evaluate all DQ rules; collect failures into a pipe-delimited string
        ARRAY_JOIN(
            FILTER(
                ARRAY[
                    CASE WHEN pickup_datetime IS NULL           THEN 'NULL_PICKUP_DT'      END,
                    CASE WHEN dropoff_datetime IS NULL          THEN 'NULL_DROPOFF_DT'     END,
                    CASE WHEN dropoff_datetime <= pickup_datetime THEN 'NEGATIVE_DURATION' END,
                    CASE WHEN trip_miles <= 0                   THEN 'NON_POSITIVE_MILES'  END,
                    CASE WHEN base_passenger_fare < 0           THEN 'NEGATIVE_FARE'       END,
                    CASE WHEN trip_duration_minutes > 480       THEN 'TRIP_OVER_8H'        END,
                    CASE WHEN trip_miles > 500                  THEN 'TRIP_OVER_500MI'     END
                ],
                x -> x IS NOT NULL
            ),
            '|'
        ) AS dq_flags_raw

    FROM enriched
)

SELECT
    trip_id,
    hvfhs_license_num,
    dispatching_base_num,
    originating_base_num,
    request_datetime,
    on_scene_datetime,
    pickup_datetime,
    dropoff_datetime,
    PULocationID,
    DOLocationID,
    trip_miles,
    trip_time,
    base_passenger_fare,
    tolls,
    bcf,
    sales_tax,
    congestion_surcharge,
    airport_fee,
    tips,
    driver_pay,
    cbd_congestion_fee,
    is_shared_request,
    is_shared_match,
    is_access_a_ride,
    is_wav_request,
    is_wav_match,
    trip_duration_minutes,
    fare_per_mile,
    total_amount,
    (dq_flags_raw = '' OR dq_flags_raw IS NULL) AS dq_passed,
    NULLIF(dq_flags_raw, '')                     AS dq_flags,
    _bronze_ingestion_id,
    _silver_timestamp
FROM dq_annotated;

For the Delta version, run the same SQL but change the target table to delta.silver.fhvhv_tripdata_clean_delta and the source to delta.bronze.fhvhv_tripdata_raw_delta. The transformation logic is identical — this is intentional, because in a real platform the transformation SQL is format-agnostic.

Step 3 — Silver Data Quality Report

After ingestion, run this report to understand the quality profile of the dataset. This is the type of query a data platform engineer would include in an automated pipeline health dashboard.

-- Silver DQ Summary
SELECT
    dq_passed,
    dq_flags,
    COUNT(*)                                AS trip_count,
    ROUND(AVG(trip_miles), 2)               AS avg_miles,
    ROUND(AVG(trip_duration_minutes), 2)    AS avg_duration_min,
    ROUND(AVG(base_passenger_fare), 2)      AS avg_fare
FROM iceberg.silver.fhvhv_tripdata_clean
GROUP BY dq_passed, dq_flags
ORDER BY trip_count DESC;

-- What fraction of trips passed DQ?
SELECT
    COUNT_IF(dq_passed)         AS passed,
    COUNT_IF(NOT dq_passed)     AS failed,
    COUNT(*)                    AS total,
    ROUND(100.0 * COUNT_IF(dq_passed) / COUNT(*), 2) AS pass_rate_pct
FROM iceberg.silver.fhvhv_tripdata_clean;

Record your results here:

Metric	Value
Total Silver rows	(fill in)
Rows passing DQ	(fill in)
DQ pass rate (%)	(fill in)
Most common DQ flag	(fill in)
Bronze rows (from Step 0)	(fill in)
Rows removed by dedup	(fill in)

Step 4 — Create Gold Schema and Tables

Gold contains pre-aggregated business metrics. The schema is intentionally narrow and stable — it models what a finance or ops team would put in a dashboard. Partition on pickup_date (DATE, not TIMESTAMP) because Gold queries are almost always date-range lookups.

4a — Iceberg Gold Table

CREATE SCHEMA IF NOT EXISTS iceberg.gold;

DROP TABLE IF EXISTS iceberg.gold.fhvhv_daily_summary;

CREATE TABLE iceberg.gold.fhvhv_daily_summary (
    -- grain: one row per (pickup_date, hvfhs_license_num)
    pickup_date              DATE,
    hvfhs_license_num        VARCHAR,

    -- volume metrics
    total_trips              BIGINT,
    total_shared_trips       BIGINT,
    total_wav_trips          BIGINT,

    -- distance & time metrics
    total_miles              DOUBLE,
    avg_miles_per_trip       DOUBLE,
    total_trip_minutes       DOUBLE,
    avg_trip_minutes         DOUBLE,

    -- financial metrics
    total_base_fare          DOUBLE,
    total_tolls              DOUBLE,
    total_tips               DOUBLE,
    total_driver_pay         DOUBLE,
    total_amount             DOUBLE,
    avg_fare_per_trip        DOUBLE,
    avg_fare_per_mile        DOUBLE,

    -- metadata
    _gold_timestamp          TIMESTAMP(6),
    _silver_row_count        BIGINT        -- how many Silver rows fed this aggregate
)
WITH (
    format       = 'PARQUET',
    location     = 's3a://demo/gold/nyc_fhvhv_iceberg/',
    partitioning = ARRAY['pickup_date']
);

4b — Delta Gold Table

CREATE SCHEMA IF NOT EXISTS delta.gold
WITH (location = 's3a://demo/gold/_schemas/gold/');

DROP TABLE IF EXISTS delta.gold.fhvhv_daily_summary_delta;

CREATE TABLE delta.gold.fhvhv_daily_summary_delta (
    pickup_date              DATE,
    hvfhs_license_num        VARCHAR,
    total_trips              BIGINT,
    total_shared_trips       BIGINT,
    total_wav_trips          BIGINT,
    total_miles              DOUBLE,
    avg_miles_per_trip       DOUBLE,
    total_trip_minutes       DOUBLE,
    avg_trip_minutes         DOUBLE,
    total_base_fare          DOUBLE,
    total_tolls              DOUBLE,
    total_tips               DOUBLE,
    total_driver_pay         DOUBLE,
    total_amount             DOUBLE,
    avg_fare_per_trip        DOUBLE,
    avg_fare_per_mile        DOUBLE,
    _gold_timestamp          TIMESTAMP(6),
    _silver_row_count        BIGINT
)
WITH (
    location = 's3a://demo/gold/nyc_fhvhv_delta/'
);

Step 5 — Populate Gold (Silver → Gold Aggregation)

-- ============================================================
-- Silver → Gold: Iceberg version
-- Only process DQ-passed rows (dq_passed = TRUE).
-- ============================================================
INSERT INTO iceberg.gold.fhvhv_daily_summary
SELECT
    CAST(pickup_datetime AS DATE)           AS pickup_date,
    hvfhs_license_num,

    COUNT(*)                                AS total_trips,
    COUNT_IF(is_shared_request)             AS total_shared_trips,
    COUNT_IF(is_wav_request)                AS total_wav_trips,

    ROUND(SUM(trip_miles), 4)               AS total_miles,
    ROUND(AVG(trip_miles), 4)               AS avg_miles_per_trip,
    ROUND(SUM(trip_duration_minutes), 2)    AS total_trip_minutes,
    ROUND(AVG(trip_duration_minutes), 2)    AS avg_trip_minutes,

    ROUND(SUM(base_passenger_fare), 2)      AS total_base_fare,
    ROUND(SUM(tolls), 2)                    AS total_tolls,
    ROUND(SUM(tips), 2)                     AS total_tips,
    ROUND(SUM(driver_pay), 2)               AS total_driver_pay,
    ROUND(SUM(total_amount), 2)             AS total_amount,
    ROUND(AVG(base_passenger_fare), 2)      AS avg_fare_per_trip,
    ROUND(AVG(fare_per_mile), 4)            AS avg_fare_per_mile,

    current_timestamp                       AS _gold_timestamp,
    COUNT(*)                                AS _silver_row_count

FROM iceberg.silver.fhvhv_tripdata_clean
WHERE dq_passed = TRUE          -- Gold only contains validated data
GROUP BY
    CAST(pickup_datetime AS DATE),
    hvfhs_license_num;

Apply the same INSERT to the Delta Gold table, sourcing from delta.silver.fhvhv_tripdata_clean_delta.

Step 6 — Gold Validation Queries

These queries prove your Gold table is correct and analytically useful. Run them and record the outputs.

-- 6a: Provider performance summary
SELECT
    hvfhs_license_num,
    SUM(total_trips)                        AS trips,
    ROUND(SUM(total_miles), 0)              AS total_miles,
    ROUND(SUM(total_amount), 2)             AS gross_revenue,
    ROUND(AVG(avg_fare_per_trip), 2)        AS avg_fare,
    ROUND(AVG(avg_trip_minutes), 1)         AS avg_duration_min,
    ROUND(AVG(avg_fare_per_mile), 3)        AS avg_fare_per_mile
FROM iceberg.gold.fhvhv_daily_summary
GROUP BY hvfhs_license_num
ORDER BY trips DESC;

-- 6b: Daily trend (useful for spotting data gaps)
SELECT
    pickup_date,
    SUM(total_trips)                        AS daily_trips,
    ROUND(SUM(total_amount), 2)             AS daily_revenue,
    ROUND(AVG(avg_trip_minutes), 1)         AS avg_duration
FROM iceberg.gold.fhvhv_daily_summary
GROUP BY pickup_date
ORDER BY pickup_date;

-- 6c: Cross-format consistency check (Iceberg vs Delta Gold must agree)
SELECT
    i.pickup_date,
    i.hvfhs_license_num,
    i.total_trips         AS iceberg_trips,
    d.total_trips         AS delta_trips,
    i.total_trips - d.total_trips AS diff
FROM iceberg.gold.fhvhv_daily_summary i
JOIN delta.gold.fhvhv_daily_summary_delta d
  ON i.pickup_date = d.pickup_date
  AND i.hvfhs_license_num = d.hvfhs_license_num
WHERE i.total_trips <> d.total_trips   -- only show discrepancies
ORDER BY ABS(i.total_trips - d.total_trips) DESC;

Expected for 6c: Zero rows. If you see discrepancies, trace back to whether the Bronze source tables were identical and whether both Silver transforms ran against the same input.

Step 7 — Table Format Introspection

This step reinforces your understanding of what Iceberg and Delta actually write to storage.

-- Iceberg: inspect snapshots created by each INSERT
SELECT
    committed_at,
    snapshot_id,
    operation,
    summary['added-records']        AS rows_added,
    summary['total-records']        AS total_records,
    summary['added-files-size']     AS bytes_added
FROM iceberg.silver."fhvhv_tripdata_clean$snapshots"
ORDER BY committed_at;

-- Iceberg: inspect physical files per partition
SELECT
    file_path,
    file_format,
    partition,
    record_count,
    file_size_in_bytes
FROM iceberg.silver."fhvhv_tripdata_clean$files"
ORDER BY partition;

-- Delta: inspect transaction history
SELECT
    version,
    timestamp,
    operation,
    operationParameters
FROM delta.silver."fhvhv_tripdata_clean_delta$history"
ORDER BY version;

Discussion questions to answer in your notes:

How many Parquet files did Iceberg create in Silver, and why that number?
What is the snapshot ID of the INSERT operation in Iceberg Silver? What would happen if you ran the INSERT again without truncating?
In the Delta history, what operation type is recorded for the INSERT? How does this differ from a MERGE?

Step 8 — (Optional Advanced) Incremental Load Simulation

This step simulates what happens in a real pipeline when new data arrives the next day.

-- Simulate a second batch: re-ingest a small subset of rows
-- with slightly different _ingest_timestamp to test deduplication.
INSERT INTO iceberg.bronze.fhvhv_tripdata_raw
SELECT
    hvfhs_license_num,
    dispatching_base_num,
    originating_base_num,
    request_datetime,
    on_scene_datetime,
    pickup_datetime,
    dropoff_datetime,
    PULocationID,
    DOLocationID,
    trip_miles,
    trip_time,
    base_passenger_fare,
    tolls,
    bcf,
    sales_tax,
    congestion_surcharge,
    airport_fee,
    tips,
    driver_pay,
    cbd_congestion_fee,
    shared_request_flag,
    shared_match_flag,
    access_a_ride_flag,
    wav_request_flag,
    wav_match_flag,
    current_timestamp           AS _ingest_timestamp,       -- new timestamp = "second run"
    's3a://demo/landing/nyc_taxi/2026/02/26/'  AS _source_prefix,
    to_hex(md5(to_utf8(
        hvfhs_license_num ||
        cast(pickup_datetime AS VARCHAR) ||
        cast(dropoff_datetime AS VARCHAR) ||
        cast(PULocationID AS VARCHAR) ||
        cast(DOLocationID AS VARCHAR)
    )))                         AS _ingestion_id
FROM iceberg.bronze.fhvhv_tripdata_raw
WHERE CAST(pickup_datetime AS DATE) = DATE '2026-01-15'  -- pick one day as the "new batch"
LIMIT 5000;

-- Now re-run the Silver INSERT (Step 2).
-- The Silver CTE deduplication should absorb the duplicates cleanly.
-- Verify: Silver row count should NOT increase by 5000.
SELECT COUNT(*) FROM iceberg.silver.fhvhv_tripdata_clean;

5. Experiment Results Template

Fill in after completing all steps.

Metric	Iceberg	Delta
Bronze row count
Silver row count (after dedup)
Rows removed by deduplication
DQ pass rate (%)
Gold rows (distinct date × provider)
Silver INSERT duration (approx.)
Gold INSERT duration (approx.)
Number of Parquet files in Silver
Storage size of Silver layer (bytes)
Cross-format Gold discrepancies

6. Key Concepts Reinforced

Why annotate DQ failures rather than filter them at Silver? Because Silver is the audit layer. Finance and compliance teams need to understand what was rejected and why. Filtering at Silver would destroy that audit trail. Gold is where you enforce the hard quality gate (WHERE dq_passed = TRUE).

Why use a CTE chain instead of nested subqueries? CTE chains map directly to dbt models. Each CTE (deduped, unique_trips, enriched, dq_annotated) becomes a potential intermediate model. Writing it this way now costs nothing and makes the future migration to dbt trivial.

Why does Gold use DATE instead of TIMESTAMP for partitioning? Because every Gold query is a date range scan — "show me the last 30 days of revenue". Partitioning on DATE eliminates entire partitions from scan with a simple WHERE pickup_date BETWEEN .... Timestamp partitioning at Gold would fragment the partition layout unnecessarily.

What is the difference between Iceberg snapshots and Delta transaction log? Iceberg writes immutable snapshot metadata files that point to data files. Delta writes a sequential JSON transaction log (_delta_log/). Both enable time travel, but Iceberg's snapshot model is better suited for concurrent multi-engine reads (Trino + Spark simultaneously), while Delta's log compaction (OPTIMIZE + VACUUM) is more mature in the Databricks ecosystem.

7. Next Experiment Preview — Experiment 03

Experiment 03: Incremental Merge & Time Travel

You will replace the naive INSERT-based Silver population with a proper MERGE INTO (upsert) statement, implement time travel queries to inspect historical versions of your Silver table, and measure the performance difference between a full-reload vs. incremental merge strategy on the same dataset.

Document version: 2026-02 | Platform: Trino 428 + Iceberg 1.4 + Delta 3.x + MinIO RELEASE.2024