Skip to main content

Bronze to Silver ETL Pipeline with SQL

This guide demonstrates how to implement a silver pipeline consuming from a bronze table created by Onehouse stream capture using Hudi and SQL. The workflow transforms raw change data (users_pii) into a curated Silver table (users_from_sql) using Hudi's Merge-On-Read (MOR) capabilities and SQL MERGE operations.

Use Case

We want to:

  • Ingest change events from a raw table on-demand.
  • Clean, deduplicate, and apply only the latest changes using a pre-combine key.

Pre-requisites

  • Raw/Bronze table users_pii created by Onehouse stream capture
  • Create a cluster with "Cluster Type" as "SQL"

Step-by-Step Instructions

Step 1: Inspect the Source Table

Read from the raw/bronze Hudi table that stores incoming change data.

SELECT * FROM otcarp.users_pii up;

Step 2: Create the Silver Table

Use Hudi's Merge-On-Read format to enable upserts and snapshot queries. This table holds the cleaned and deduplicated view of your user data.

CREATE TABLE otcarp.users_from_sql (
_change_operation_type STRING,
_upstream_event_processed_ts_ms BIGINT,
db_shard_source_partition STRING,
_event_origin_ts_ms BIGINT,
_event_tx_id BIGINT,
_event_lsn BIGINT,
_event_xmin BIGINT,
id INT,
name STRING,
ssn STRING,
email STRING,
signup_date TIMESTAMP
)
USING hudi
TBLPROPERTIES (
type = 'mor', -- Merge-On-Read storage for incremental updates
primaryKey = 'id', -- Record key for upserts
precombineField = '_event_lsn' -- Deduplicate using latest LSN
)
LOCATION 's3a://lakehouse-albert-us-west-2/demolake/otcarp/users_from_sql';

Step 3: Merge Incoming Changes

Use a SQL MERGE INTO operation to apply inserts and updates from the raw table.

MERGE INTO otcarp.users_from_sql AS target
USING (
SELECT
_change_operation_type,
_upstream_event_processed_ts_ms,
db_shard_source_partition,
_event_origin_ts_ms,
_event_tx_id,
_event_lsn,
_event_xmin,
id,
name,
ssn,
email,
signup_date
FROM otcarp.users_pii
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET * -- Update record if ID already exists
WHEN NOT MATCHED THEN INSERT *; -- Insert new records

Once this query succeeds, your table will be available in the Onehouse Data Catalog. target.png

Step 4: Query the Silver Table

From within the SQL Endpoint, you can now query the Silver table to see the silver table data.

SELECT * FROM otcarp.users_from_sql;

From external tools like Athena, you can query the Silver table using the following SQL:

SELECT * FROM "AwsDataCatalog"."otcarp"."users_from_sql_ro" limit 10;

athena.png

Next Steps

If you want to automate this process, you can use an orchestration tools like Dagster and Airflow to schedule the SQL ETL pipeline.