onehouse_flow

Configures a flow — an ingestion pipeline that reads from an onehouse_source and writes to a destination Onehouse table identified by (lake, database, table_name).

Canonical reference

This page documents Terraform-specific behavior (HCL syntax, types, mutability, drift, import). For full parameter semantics, valid values, and defaults, see CREATE FLOW, ALTER FLOW, and DELETE FLOW.

The <source-path> {} sub-block (one of s3 {}, gcs {}, kafka {}, onehouse_table {}, postgres {}, mysql {}) must match the type of the referenced source. The server validates the pairing at CREATE time.

Example Usage

Minimal S3 → Onehouse-table flow

resource "onehouse_flow" "events_s3" {
  name       = "raw-events-flow"
  source     = onehouse_source.events_s3.name
  lake       = onehouse_lake.warehouse.name
  database   = onehouse_database.events.name
  table_name = "raw_events"
  write_mode = "MUTABLE"
  cluster    = onehouse_cluster.ingest.name

  record_key_fields = ["id"]

  s3 {
    folder_uri     = "s3://my-raw-events-bucket/2024/"
    file_format    = "PARQUET"
    file_extension = ".parquet"
  }
}

Kafka flow with schema registry

resource "onehouse_flow" "events_kafka" {
  name       = "kafka-events-flow"
  source     = onehouse_source.kafka.name
  lake       = onehouse_lake.warehouse.name
  database   = onehouse_database.events.name
  table_name = "kafka_events"
  write_mode = "MUTABLE"
  cluster    = onehouse_cluster.ingest.name

  kafka {
    topic_name       = "events.v1"
    starting_offsets = "latest"
  }

  schema_registry {
    type = "confluent"
    confluent {
      servers      = "https://psrc-xxxxx.us-west-2.aws.confluent.cloud"
      subject_name = "events-value"
      key          = "SR_KEY"
      secret       = "SR_SECRET"
    }
  }
}

Flow with partitioning, transformations, and validations

resource "onehouse_flow" "events_partitioned" {
  name       = "partitioned-events"
  source     = onehouse_source.events_s3.name
  lake       = onehouse_lake.warehouse.name
  database   = onehouse_database.events.name
  table_name = "events_by_date"
  write_mode = "MUTABLE"
  cluster    = onehouse_cluster.ingest.name

  record_key_fields    = ["id"]
  precombine_key_field = "updated_at"
  sorting_key_fields   = ["ts"]

  performance_profile     = "BALANCED"
  min_sync_frequency_mins = 5
  quarantine_enabled      = true
  table_type              = "MERGE_ON_READ"

  catalogs        = ["prod-glue"]
  transformations = ["mask_pii"]
  validations     = ["schema_check"]

  partition_key_fields = [
    {
      field          = "event_date"
      partition_type = "DATE_STRING"
      input_format   = "yyyy-MM-dd"
      output_format  = "yyyy-MM-dd"
    },
    {
      field = "tenant_id"
      # Non-timestamp partitions: leave optional inner fields unset.
    },
  ]

  s3 {
    folder_uri     = "s3://my-raw-events-bucket/"
    file_format    = "PARQUET"
    file_extension = ".parquet"
  }
}

Pause, resume, and clean-restart a flow

The state attribute lets you pause and resume a flow declaratively. The clean_and_restart_trigger is a fire-once trigger — any change to its value drops and rebuilds the destination table from scratch.

resource "onehouse_flow" "events" {
  # ... (other fields)

  state = "PAUSED" # change to "RUNNING" to resume

  # Bump this value to fire a CLEAN_AND_RESTART. Use a timestamp, sha256, or
  # any unique string. Removing the line is a no-op (won't undo the clean).
  clean_and_restart_trigger = "2026-05-11T00:00:00Z"
}

warning

CLEAN_AND_RESTART is destructive — it drops the destination table and rebuilds it from the source. Use only when needed.

Argument Reference

Top-level

Argument	Type	Required	Mutability	Description
`name`	string	✅	Immutable	Flow name. → details
`source`	string	✅	Immutable	Name of the `onehouse_source`. → details
`lake`	string	✅	Immutable	Destination lake name.
`database`	string	✅	Immutable	Destination database name.
`table_name`	string	✅	Immutable	Destination table name (created by the flow).
`write_mode`	string	✅	Immutable	`MUTABLE` or `IMMUTABLE`. → details below
`cluster`	string	✅	Mutable	Compute cluster name that runs the flow. Changes issue `ALTER FLOW SET CLUSTER`. → details
`performance_profile`	string		Mutable	`BALANCED`, `FASTEST_READ`, or `FASTEST_WRITE`. → details
`min_sync_frequency_mins`	number		Mutable	Minimum minutes between flow triggers. Server default `1`. → details
`transformations`	list(string)		Immutable	Names of transformations to apply.
`validations`	list(string)		Immutable	Names of validations to apply.
`quarantine_enabled`	boolean		Mutable	If true, invalid records are quarantined instead of failing the flow. Server default `true`. → details
`state`	string		Mutable	`RUNNING` or `PAUSED`. Changes issue `ALTER FLOW SET STATE = PAUSE/RESUME`. → details below
`clean_and_restart_trigger`	string		Mutable	Fire-once trigger. Any change fires `ALTER FLOW SET STATE = CLEAN_AND_RESTART`. → details below
`catalogs`	list(string)		Immutable	Names of catalogs to sync the destination table to.
`record_key_fields`	list(string)		Mutable	Record-key columns used for dedup/update/delete. Required for S3 and Onehouse-table sources. Changes issue `ALTER FLOW`. → details
`precombine_key_field`	string		Immutable	When two records share a record key, the larger precombine value wins. → details
`sorting_key_fields`	list(string)		Immutable	Sorting-key columns applied at ingest time. → details
`partition_key_fields`	list(object)		Immutable	Destination-table partition definition. → details below
`table_type`	string		Immutable	`COPY_ON_WRITE` or `MERGE_ON_READ`. Server default `MERGE_ON_READ`. → details
`delay_threshold_num_sync_intervals`	number		Immutable	Multiplied by sync frequency to determine when the flow is considered delayed. `0` disables. → details
`deduplication_policy`	string		Immutable	`none` (default) or `drop` (for append-only flows). → details
`table_configured_base_path`	string		Immutable	Custom storage location for the destination table. → details
`table_partition_style`	string		Immutable	`default` or `hive`. → details

`write_mode` — `MUTABLE` vs `IMMUTABLE`

MUTABLE tables support upserts and deletes — record keys identify rows that can be updated. IMMUTABLE tables are append-only and don't carry per-record keys. Pick MUTABLE for CDC and updatable datasets, IMMUTABLE for event/log ingestion where every record is a new fact. Changing this attribute forces destroy + recreate of the flow.

`partition_key_fields`

Each list entry is an object with four fields. All four must be present per the API contract — non-timestamp partitions use empty strings for the optional ones.

Field	Required	Description
`field`	✅	Column name.
`partition_type`		One of `DATE_STRING`, `EPOCH_MILLIS`, `EPOCH_MICROS`, or empty for non-timestamp partitions.
`input_format`		Source data format (e.g., `yyyy-MM-dd`). Empty for non-timestamp.
`output_format`		Partition output format (e.g., `yyyy-MM-dd`, `yyyyMMddHH`). Empty for non-timestamp.

`state` — drift detection

state participates in normal Terraform drift detection. If someone pauses a flow in the Onehouse console while your Terraform config says state = "RUNNING", the next terraform plan shows the drift and terraform apply reconciles it.

Step	What happens
1. `terraform apply` with `state = "RUNNING"`	Flow runs.
2. Operator clicks Pause in the Onehouse console	Server state is now `PAUSED`.
3. `terraform plan`	Provider reads `SHOW FLOWS`, sees `PAUSED`, refreshes Terraform state. Plan shows `~ state = "PAUSED" -> "RUNNING"`.
4. `terraform apply`	Provider dispatches `ALTER FLOW SET STATE = RESUME`. Flow runs again.

To opt out of Terraform-enforced state (so your ops team can pause/resume without Terraform reverting them), omit state from your HCL entirely. The field is Optional + Computed, so Terraform reads and surfaces the current value without enforcing it.

Pattern	What you write	Drift behavior
Declared	`state = "RUNNING"`	Plan shows drift; apply reconciles.
Observed-only	(omit `state`)	Field tracks server value; no enforcement.

`clean_and_restart_trigger`

A fire-once trigger. Any change to its value drops the destination table and rebuilds it from the source via ALTER FLOW SET STATE = CLEAN_AND_RESTART. Use any unique string — a timestamp, a SHA, an incrementing counter. Removing the line is a no-op (it does not undo the clean).

warning

CLEAN_AND_RESTART is destructive — it drops the destination table and rebuilds from the source. Use only when intentional.

Source-path sub-blocks

Exactly one of these must be set, matching the source's type. Each maps to the same source-side keys as the corresponding onehouse_source block.

Sub-block	Required fields	Use when source type is	SQL ref
`s3 {}`	`folder_uri`, `file_format`, `file_extension`	`S3`	S3 source
`gcs {}`	`folder_uri`, `file_format`, `file_extension`	`GCS`	GCS source
`kafka {}`	`topic_name`, `starting_offsets`	`APACHE_KAFKA` / `MSK_KAFKA` / `CONFLUENT_KAFKA`	Kafka source
`onehouse_table {}`	`lake`, `database`, `name`	`ONEHOUSE_TABLE`	Onehouse source
`postgres {}`	`table_name`, `schema_name`	`POSTGRES`	Postgres source
`mysql {}`	`table_name`	`MY_SQL`	MySQL source

`s3 {}` / `gcs {}` block fields

Argument	Type	Required	Description
`folder_uri`	string	✅	Source folder URI (e.g. `s3://bucket/path/`).
`file_format`	string	✅	One of `AVRO`, `JSON`, `CSV`, `ORC`, `PARQUET`.
`file_extension`	string	✅	File extension filter (e.g. `.parquet`, `.json`, `.gz`).
`source_bootstrap`	string		`TRUE` or `FALSE` — whether to backfill existing files.
`if_infer_fields_from_source_path`	string		`TRUE` or `FALSE` — extract partition fields from the path.
`fields_to_infer`	string		Comma-separated field names to extract from the source path.
`file_path_pattern`	string		Optional path pattern filter.
`csv_header`	string		`TRUE` or `FALSE` — whether CSV files have a header row. Only applies when `file_format = "CSV"`.

The optional top-level schema_registry {} block (same shape as in onehouse_source) applies across all source-path types. → Schema registry

Attribute Reference

Attribute	Type	Description
`id`	string	Flow UUID.
`created_at`	string	Creation time in RFC3339.
`created_by`	string	Identity that created the flow.
`state`	string	Runtime state (`RUNNING`, `PAUSED`, or whatever the server reports).

Import

terraform import onehouse_flow.events events-kafka

After import, only the top-level summary fields are recovered. Most attributes (source, lake, database, source-path sub-block contents, etc.) must be re-supplied in the .tf file to avoid a forced replacement.

Data Source

data "onehouse_flow" "lookup" {
  name = "events-kafka"
}

output "flow_state" {
  value = data.onehouse_flow.lookup.state
}

Limitations

Source change is destructive. Changing source, lake, database, table_name, or write_mode forces destroy + recreate.
Advanced configs are immutable in place. delay_threshold_num_sync_intervals, deduplication_policy, table_configured_base_path, table_partition_style cannot be changed via ALTER FLOW yet — SET ADVANCED_CONFIGS support is a follow-up.

Example Usage​

Minimal S3 → Onehouse-table flow​

Kafka flow with schema registry​

Flow with partitioning, transformations, and validations​

Pause, resume, and clean-restart a flow​

Argument Reference​

Top-level​

write_mode — MUTABLE vs IMMUTABLE​

partition_key_fields​

state — drift detection​

clean_and_restart_trigger​

Source-path sub-blocks​

s3 {} / gcs {} block fields​

Attribute Reference​

Import​

Data Source​

Limitations​