Connect dbt core

dbt is a framework that makes it easy to transform data, with official support for Onehouse documented at: https://docs.getdbt.com/docs/cloud/connect-data-platform/connect-onehouse.

You can use dbt Core within your VPC to run pipelines on the Onehouse SQL Cluster. To set up dbt Cloud, follow this solution guide.

dbt Core instructions:

On your bastion host (or another machine within your VPC), install dbt Core and the dbt-spark[PyHive] adapter
Clone a dbt project or use the dbt init command line to create a new dbt project

Create or update the profiles.yml file to specify connection details for the target where tables will be written:

    dbt_project:
        outputs:
            dev:
                host: <SQL-CLUSTER-ENDPOINT>
                method: thrift
                port: 10000
                schema: <DATABASE>
                threads: 1
                type: spark
        target: dev

Create your model under the models/ directory. Set configs at the top of the model file to specify how the table will be written. Here are some example configs:

{{
    config(
        materialized='incremental',
        file_format='hudi',
        incremental_strategy='merge',
        unique_key='store_id',
        partition_by='sale_date',
        location_root='s3://bucket/lake/database/',
        options={
        'type': 'mor',
        'primaryKey': 'sale_id',
        'precombineKey': 'event_timestamp'
        }
    )
}}

In your query, you can reference an existing Onehouse table created by a Stream Capture with the following syntax: from database_name.table_name.
Including location_root is required for table materialization mode, and optional for incremental materialization mode.

Run dbt run in command line to execute the model