Skip to main content

Connect dbt Cloud

This guide outlines the steps to connect dbt Cloud to a Onehouse SQL Cluster and run the Jaffle Shop example project.

Requirements

  • For dbt Cloud, ensure your Onehouse SQL endpoint is accessible via external DNS/IP, whitelisting dbt Cloud IPs.
  • For dbt Core, you can SSH tunnel into the VPC to access the endpoint, without opening up ports.

What works

  • All dbt features like dbt clean, dbt compile, dbt debug, dbt seed, and dbt run.
  • dbt materialized type of "table" and "incrementals" .
  • Apache Hudi table types of Merge on Read (MoR) and Copy on Write (CoW). It is recommended to use MoR for mutable workloads.

Limitations

  • dbt materialized type cannot be "view".
  • dbt seed has row / record limits.
  • dbt seed only supports CoW tables.

dbt connection

Fill in the following fields when creating an Apache Spark warehouse connection using the Thrift connection method:

FieldDescriptionExamples
MethodThe method for connecting to SparkThrift
HostnameThe hostname of your Onehouse SQL Cluster endpointyourProject.sparkHost.com
PortThe port to connect to Spark on10000
ClusterOnehouse does not use this field
Connection TimeoutNumber of seconds after which to timeout a connection10
Connection RetriesNumber of times to attempt connecting to cluster before failing0
OrganizationOnehouse does not use this field
UserOptional. Not enabled by default.dbt_cloud_user
AuthOptional, supply if using Kerberos. Not enabled by default.KERBEROS
Kerberos Service NameOptional, supply if using Kerberos. Not enabled by default.hive

dbt project

When using dbt, ensure you add necessary configurations to dbt_project.yml for the dbt connector to write data correctly.

FieldDescriptionRequiredDefaultRecommended
materializedType of table materializationYestable
file_formatOpen table format to writeYeshudi
location_rootLocation of the database in DFSYes
hoodie.table.typeMerge on Read or Copy on WriteNocowmor

dbt_project.yml template

      +materialized: table | incremental
+file_format: hudi
+location_root: s3://lakehouse/demolake/dbt_ecomm/
+tblproperties:
hoodie.table.type: mor | cow

An dbt_project.yml example if using jaffle shop would be

models:
jaffle_shop:
+file_format: hudi
+location_root: s3://lakehouse/demolake/dbt_ecomm/
+tblproperties:
hoodie.table.type: mor
staging:
+materialized: incremental
marts:
+materialized: table

Quickstart

Steps

  1. Download Jaffle Shop at https://github.com/dbt-labs/jaffle-shop

  2. Create 2 Onehouse databases either in the UI or API

    • Create a "raw" database. This is hard coded in the dbt project. If you want to change it, modify dbt_project.yml and models/staging/__sources.yml.
    • Create a "dbt_awong" (this is set in dbt environment).
  3. Create a DBT cloud connection of type Apache Spark. The cluster field can have a dummy entry.

  1. Create a DBT environment. We will set the connection to the DBT Environemtn of Apache Spark. The token will automatically be filled in if left blank. Schema is your choice and in this example we picked "dbt_awong".

  1. Modify the code to support Apache Hudi. Add the file_format and location_root variables to your dbt_project.yml
seeds:
jaffle_shop:
+schema: raw
+file_format: hudi
+location_root: s3://lakehouse-albert-us-west-2/demolake/raw
+tblproperties:
hoodie.table.type: mor

models:
jaffle_shop:
+file_format: hudi
+location_root: s3://lakehouse-albert-us-west-2/demolake/dbt_awong
+tblproperties:
hoodie.table.type: mor
staging:
+materialized: table
marts:
+materialized: table
  1. Modify the macros/cents_to_dollars.sql to use Apache Spark SQL syntax.
{% macro spark__cents_to_dollars(column_name) -%}
CAST({{ column_name }} / 100 AS numeric(16, 2))
{%- endmacro %}
  1. Due to dbt Spark limitations, reduce the size of the raw_orders.csv to 7500 rows.
  2. Run dbt seed and dbt run to deploy the tables and use your favorite query engine to view the data.