Skip to main content

Connect dbt core

dbt is a framework that makes it easy to transform data, with official support for Onehouse documented at: https://docs.getdbt.com/docs/cloud/connect-data-platform/connect-onehouse.

You can use dbt Core within your VPC to run pipelines on the Onehouse SQL Cluster. To set up dbt Cloud, follow this solution guide.

Connect to your VPC

By default, SQL Clusters only accept traffic from within your VPC. To submit queries with SQL clients outside the VPC, such as your local machine, follow these steps to connect through a VPN or bastion host.

Set up dbt Core

  1. On the machine connected to your VPC, install dbt Core and the dbt-spark[PyHive] adapter
  2. Clone a dbt project or use the dbt init command line to create a new dbt project
  3. Create or update the profiles.yml file to specify connection details for the target where tables will be written.
    1. Port: 10000 (or your local forwarded port)
    2. Host:
      1. If connecting through a VPN, use the SQL endpoint URL from the Onehouse console
      2. If connecting through a bastion host, use localhost
    3. Template:
      dbt_project:
      outputs:
      dev:
      host: <SQL-CLUSTER-ENDPOINT>
      method: thrift
      port: 10000
      schema: <DATABASE>
      threads: 1
      type: spark
      target: dev
  4. Create your model under the models/ directory. Set configs at the top of the model file to specify how the table will be written. Here are some example configs:
    {{
    config(
    materialized='incremental',
    file_format='hudi',
    incremental_strategy='merge',
    unique_key='store_id',
    partition_by='sale_date',
    location_root='s3://bucket/lake/database/',
    options={
    'type': 'mor',
    'primaryKey': 'sale_id',
    'precombineKey': 'event_timestamp'
    }
    )
    }}
    1. In your query, you can reference an existing Onehouse table created by a Flow with the following syntax: from database_name.table_name.
    2. Including location_root is required for table materialization mode, and optional for incremental materialization mode.
  5. Run dbt run in command line to execute the model