Skip to main content

Connect dbt core

dbt is a framework that makes it easy to transform data, with official support for Onehouse documented at: https://docs.getdbt.com/docs/cloud/connect-data-platform/connect-onehouse.

You can use dbt Core within your VPC to run pipelines on the Onehouse SQL Cluster. To set up dbt Cloud, follow this solution guide.

dbt Core instructions:

  1. On your bastion host (or another machine within your VPC), install dbt Core and the dbt-spark[PyHive] adapter
  2. Clone a dbt project or use the dbt init command line to create a new dbt project
  3. Create or update the profiles.yml file to specify connection details for the target where tables will be written:
        dbt_project:
    outputs:
    dev:
    host: <SQL-CLUSTER-ENDPOINT>
    method: thrift
    port: 10000
    schema: <DATABASE>
    threads: 1
    type: spark
    target: dev
  4. Create your model under the models/ directory. Set configs at the top of the model file to specify how the table will be written. Here are some example configs:
    {{
    config(
    materialized='incremental',
    file_format='hudi',
    incremental_strategy='merge',
    unique_key='store_id',
    partition_by='sale_date',
    location_root='s3://bucket/lake/database/',
    options={
    'type': 'mor',
    'primaryKey': 'sale_id',
    'precombineKey': 'event_timestamp'
    }
    )
    }}
    1. In your query, you can reference an existing Onehouse table created by a Stream Capture with the following syntax: from database_name.table_name.
    2. Including location_root is required for table materialization mode, and optional for incremental materialization mode.
  5. Run dbt run in command line to execute the model