Connect dbt core
dbt is a framework that makes it easy to transform data, with official support for Onehouse documented at: https://docs.getdbt.com/docs/cloud/connect-data-platform/connect-onehouse.
You can use dbt Core within your VPC to run pipelines on the Onehouse SQL Cluster. To set up dbt Cloud, follow this solution guide.
dbt Core instructions:
- On your bastion host (or another machine within your VPC), install dbt Core and the dbt-spark[PyHive] adapter
- Clone a dbt project or use the
dbt init
command line to create a new dbt project - Create or update the
profiles.yml
file to specify connection details for the target where tables will be written:dbt_project:
outputs:
dev:
host: <SQL-CLUSTER-ENDPOINT>
method: thrift
port: 10000
schema: <DATABASE>
threads: 1
type: spark
target: dev - Create your model under the
models/
directory. Set configs at the top of the model file to specify how the table will be written. Here are some example configs:{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='merge',
unique_key='store_id',
partition_by='sale_date',
location_root='s3://bucket/lake/database/',
options={
'type': 'mor',
'primaryKey': 'sale_id',
'precombineKey': 'event_timestamp'
}
)
}}- In your query, you can reference an existing Onehouse table created by a Stream Capture with the following syntax:
from database_name.table_name
. - Including
location_root
is required fortable
materialization mode, and optional forincremental
materialization mode.
- In your query, you can reference an existing Onehouse table created by a Stream Capture with the following syntax:
- Run
dbt run
in command line to execute the model