Connect dbt core
dbt is a framework that makes it easy to transform data, with official support for Onehouse documented at: https://docs.getdbt.com/docs/cloud/connect-data-platform/connect-onehouse.
You can use dbt Core within your VPC to run pipelines on the Onehouse SQL Cluster. To set up dbt Cloud, follow this solution guide.
Connect to your VPC
By default, SQL Clusters only accept traffic from within your VPC. To submit queries with SQL clients outside the VPC, such as your local machine, follow these steps to connect through a VPN or bastion host.
Set up dbt Core
- On the machine connected to your VPC, install dbt Core and the dbt-spark[PyHive] adapter
- Clone a dbt project or use the
dbt initcommand line to create a new dbt project - Create or update the
profiles.ymlfile to specify connection details for the target where tables will be written.- Port: 10000 (or your local forwarded port)
- Host:
- If connecting through a VPN, use the SQL endpoint URL from the Onehouse console
- If connecting through a bastion host, use
localhost
- Template:
dbt_project:
outputs:
dev:
host: <SQL-CLUSTER-ENDPOINT>
method: thrift
port: 10000
schema: <DATABASE>
threads: 1
type: spark
target: dev
- Create your model under the
models/directory. Set configs at the top of the model file to specify how the table will be written. Here are some example configs:{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='merge',
unique_key='store_id',
partition_by='sale_date',
location_root='s3://bucket/lake/database/',
options={
'type': 'mor',
'primaryKey': 'sale_id',
'precombineKey': 'event_timestamp'
}
)
}}- In your query, you can reference an existing Onehouse table created by a Flow with the following syntax:
from database_name.table_name. - Including
location_rootis required fortablematerialization mode, and optional forincrementalmaterialization mode.
- In your query, you can reference an existing Onehouse table created by a Flow with the following syntax:
- Run
dbt runin command line to execute the model