Redshift Spectrum Onehouse Integration

Context

This guide outlines how to integrate Onehouse-managed data with Amazon Redshift Spectrum for seamless querying. By setting up a Stream Capture in Onehouse and cataloging data into AWS Glue, users can expose Hudi-managed datasets directly to Redshift Spectrum through an external schema. This allows Redshift to efficiently query Onehouse-managed lakehouse data without needing to move or duplicate it, enabling powerful analytics on live data.

Steps

Onehouse: Create a stream capture

Follow the steps highlighted in the docs to create a Onehouse Stream Capture
Make sure you choose AWS Glue Catalog as one of the catalog options. Follow Create Catalog docs to setup this up
Once the first batch of ingestion is complete, the Metadata Sync service will catalog the data to Glue Data Catalog

AWS: Create external schema

Upon the first successful Metadata Sync to Glue, the table will be ready to be queried by Redshift Spectrum
A necessary pre-requisute to query the external tables in Redshift Spectrum is to create a schema

CREATE EXTERNAL SCHEMA IF NOT EXISTS <schema_name>
FROM data catalog
DATABASE '<glue_database_name>'
IAM_ROLE 'arn:aws:iam::<accountId>:role/redshift-glue-s3-full-access-role'
REGION '<region>';

Note1: To work as expected, Redshift needs to have access to Glue Catalog and S3. Make sure you add adequate policies to the Redshift Role.

Note2: Redshift Spectrum’s schema has a 1:1 mapping with Glue Database tables

So all the tables in <glue_database_name> database in AWS Glue can be queried from Redshift Spectrum now

AWS: Querying the tables

This includes queries like

SELECT * FROM <schema_name>.users_pii_ro;

redshift-query

Context​

Steps​

Onehouse: Create a stream capture​

AWS: Create external schema​

AWS: Querying the tables​

Context

Steps

Onehouse: Create a stream capture

AWS: Create external schema

AWS: Querying the tables