onehouse_flow
Configures a flow — an ingestion pipeline that reads from an onehouse_source and writes to a destination Onehouse table identified by (lake, database, table_name).
This page documents Terraform-specific behavior (HCL syntax, types, mutability, drift, import). For full parameter semantics, valid values, and defaults, see CREATE FLOW, ALTER FLOW, and DELETE FLOW.
The <source-path> {} sub-block (one of s3 {}, gcs {}, kafka {}, onehouse_table {}, postgres {}, mysql {}) must match the type of the referenced source. The server validates the pairing at CREATE time.
Example Usage
Minimal S3 → Onehouse-table flow
resource "onehouse_flow" "events_s3" {
name = "raw-events-flow"
source = onehouse_source.events_s3.name
lake = onehouse_lake.warehouse.name
database = onehouse_database.events.name
table_name = "raw_events"
write_mode = "MUTABLE"
cluster = onehouse_cluster.ingest.name
record_key_fields = ["id"]
s3 {
folder_uri = "s3://my-raw-events-bucket/2024/"
file_format = "PARQUET"
file_extension = ".parquet"
}
}
Kafka flow with schema registry
resource "onehouse_flow" "events_kafka" {
name = "kafka-events-flow"
source = onehouse_source.kafka.name
lake = onehouse_lake.warehouse.name
database = onehouse_database.events.name
table_name = "kafka_events"
write_mode = "MUTABLE"
cluster = onehouse_cluster.ingest.name
kafka {
topic_name = "events.v1"
starting_offsets = "latest"
}
schema_registry {
type = "confluent"
confluent {
servers = "https://psrc-xxxxx.us-west-2.aws.confluent.cloud"
subject_name = "events-value"
key = "SR_KEY"
secret = "SR_SECRET"
}
}
}
Flow with partitioning, transformations, and validations
resource "onehouse_flow" "events_partitioned" {
name = "partitioned-events"
source = onehouse_source.events_s3.name
lake = onehouse_lake.warehouse.name
database = onehouse_database.events.name
table_name = "events_by_date"
write_mode = "MUTABLE"
cluster = onehouse_cluster.ingest.name
record_key_fields = ["id"]
precombine_key_field = "updated_at"
sorting_key_fields = ["ts"]
performance_profile = "BALANCED"
min_sync_frequency_mins = 5
quarantine_enabled = true
table_type = "MERGE_ON_READ"
catalogs = ["prod-glue"]
transformations = ["mask_pii"]
validations = ["schema_check"]
partition_key_fields = [
{
field = "event_date"
partition_type = "DATE_STRING"
input_format = "yyyy-MM-dd"
output_format = "yyyy-MM-dd"
},
{
field = "tenant_id"
# Non-timestamp partitions: leave optional inner fields unset.
},
]
s3 {
folder_uri = "s3://my-raw-events-bucket/"
file_format = "PARQUET"
file_extension = ".parquet"
}
}
Pause, resume, and clean-restart a flow
The state attribute lets you pause and resume a flow declaratively. The clean_and_restart_trigger is a fire-once trigger — any change to its value drops and rebuilds the destination table from scratch.
resource "onehouse_flow" "events" {
# ... (other fields)
state = "PAUSED" # change to "RUNNING" to resume
# Bump this value to fire a CLEAN_AND_RESTART. Use a timestamp, sha256, or
# any unique string. Removing the line is a no-op (won't undo the clean).
clean_and_restart_trigger = "2026-05-11T00:00:00Z"
}
CLEAN_AND_RESTART is destructive — it drops the destination table and rebuilds it from the source. Use only when needed.
Argument Reference
Top-level
| Argument | Type | Required | Mutability | Description |
|---|---|---|---|---|
name | string | ✅ | Immutable | Flow name. → details |
source | string | ✅ | Immutable | Name of the onehouse_source. → details |
lake | string | ✅ | Immutable | Destination lake name. |
database | string | ✅ | Immutable | Destination database name. |
table_name | string | ✅ | Immutable | Destination table name (created by the flow). |
write_mode | string | ✅ | Immutable | MUTABLE or IMMUTABLE. → details below |
cluster | string | ✅ | Mutable | Compute cluster name that runs the flow. Changes issue ALTER FLOW SET CLUSTER. → details |
performance_profile | string | Mutable | BALANCED, FASTEST_READ, or FASTEST_WRITE. → details | |
min_sync_frequency_mins | number | Mutable | Minimum minutes between flow triggers. Server default 1. → details | |
transformations | list(string) | Immutable | Names of transformations to apply. | |
validations | list(string) | Immutable | Names of validations to apply. | |
quarantine_enabled | boolean | Mutable | If true, invalid records are quarantined instead of failing the flow. Server default true. → details | |
state | string | Mutable | RUNNING or PAUSED. Changes issue ALTER FLOW SET STATE = PAUSE/RESUME. → details below | |
clean_and_restart_trigger | string | Mutable | Fire-once trigger. Any change fires ALTER FLOW SET STATE = CLEAN_AND_RESTART. → details below | |
catalogs | list(string) | Immutable | Names of catalogs to sync the destination table to. | |
record_key_fields | list(string) | Mutable | Record-key columns used for dedup/update/delete. Required for S3 and Onehouse-table sources. Changes issue ALTER FLOW. → details | |
precombine_key_field | string | Immutable | When two records share a record key, the larger precombine value wins. → details | |
sorting_key_fields | list(string) | Immutable | Sorting-key columns applied at ingest time. → details | |
partition_key_fields | list(object) | Immutable | Destination-table partition definition. → details below | |
table_type | string | Immutable | COPY_ON_WRITE or MERGE_ON_READ. Server default MERGE_ON_READ. → details | |
delay_threshold_num_sync_intervals | number | Immutable | Multiplied by sync frequency to determine when the flow is considered delayed. 0 disables. → details | |
deduplication_policy | string | Immutable | none (default) or drop (for append-only flows). → details | |
table_configured_base_path | string | Immutable | Custom storage location for the destination table. → details | |
table_partition_style | string | Immutable | default or hive. → details |
write_mode — MUTABLE vs IMMUTABLE
MUTABLE tables support upserts and deletes — record keys identify rows that can be updated. IMMUTABLE tables are append-only and don't carry per-record keys. Pick MUTABLE for CDC and updatable datasets, IMMUTABLE for event/log ingestion where every record is a new fact. Changing this attribute forces destroy + recreate of the flow.
partition_key_fields
Each list entry is an object with four fields. All four must be present per the API contract — non-timestamp partitions use empty strings for the optional ones.
| Field | Required | Description |
|---|---|---|
field | ✅ | Column name. |
partition_type | One of DATE_STRING, EPOCH_MILLIS, EPOCH_MICROS, or empty for non-timestamp partitions. | |
input_format | Source data format (e.g., yyyy-MM-dd). Empty for non-timestamp. | |
output_format | Partition output format (e.g., yyyy-MM-dd, yyyyMMddHH). Empty for non-timestamp. |
state — drift detection
state participates in normal Terraform drift detection. If someone pauses a flow in the Onehouse console while your Terraform config says state = "RUNNING", the next terraform plan shows the drift and terraform apply reconciles it.
| Step | What happens |
|---|---|
1. terraform apply with state = "RUNNING" | Flow runs. |
| 2. Operator clicks Pause in the Onehouse console | Server state is now PAUSED. |
3. terraform plan | Provider reads SHOW FLOWS, sees PAUSED, refreshes Terraform state. Plan shows ~ state = "PAUSED" -> "RUNNING". |
4. terraform apply | Provider dispatches ALTER FLOW SET STATE = RESUME. Flow runs again. |
To opt out of Terraform-enforced state (so your ops team can pause/resume without Terraform reverting them), omit state from your HCL entirely. The field is Optional + Computed, so Terraform reads and surfaces the current value without enforcing it.
| Pattern | What you write | Drift behavior |
|---|---|---|
| Declared | state = "RUNNING" | Plan shows drift; apply reconciles. |
| Observed-only | (omit state) | Field tracks server value; no enforcement. |
clean_and_restart_trigger
A fire-once trigger. Any change to its value drops the destination table and rebuilds it from the source via ALTER FLOW SET STATE = CLEAN_AND_RESTART. Use any unique string — a timestamp, a SHA, an incrementing counter. Removing the line is a no-op (it does not undo the clean).
CLEAN_AND_RESTART is destructive — it drops the destination table and rebuilds from the source. Use only when intentional.
Source-path sub-blocks
Exactly one of these must be set, matching the source's type. Each maps to the same source-side keys as the corresponding onehouse_source block.
| Sub-block | Required fields | Use when source type is | SQL ref |
|---|---|---|---|
s3 {} | folder_uri, file_format, file_extension | S3 | S3 source |
gcs {} | folder_uri, file_format, file_extension | GCS | GCS source |
kafka {} | topic_name, starting_offsets | APACHE_KAFKA / MSK_KAFKA / CONFLUENT_KAFKA | Kafka source |
onehouse_table {} | lake, database, name | ONEHOUSE_TABLE | Onehouse source |
postgres {} | table_name, schema_name | POSTGRES | Postgres source |
mysql {} | table_name | MY_SQL | MySQL source |
s3 {} / gcs {} block fields
| Argument | Type | Required | Description |
|---|---|---|---|
folder_uri | string | ✅ | Source folder URI (e.g. s3://bucket/path/). |
file_format | string | ✅ | One of AVRO, JSON, CSV, ORC, PARQUET. |
file_extension | string | ✅ | File extension filter (e.g. .parquet, .json, .gz). |
source_bootstrap | string | TRUE or FALSE — whether to backfill existing files. | |
if_infer_fields_from_source_path | string | TRUE or FALSE — extract partition fields from the path. | |
fields_to_infer | string | Comma-separated field names to extract from the source path. | |
file_path_pattern | string | Optional path pattern filter. | |
csv_header | string | TRUE or FALSE — whether CSV files have a header row. Only applies when file_format = "CSV". |
The optional top-level schema_registry {} block (same shape as in onehouse_source) applies across all source-path types. → Schema registry
Attribute Reference
| Attribute | Type | Description |
|---|---|---|
id | string | Flow UUID. |
created_at | string | Creation time in RFC3339. |
created_by | string | Identity that created the flow. |
state | string | Runtime state (RUNNING, PAUSED, or whatever the server reports). |
Import
terraform import onehouse_flow.events events-kafka
After import, only the top-level summary fields are recovered. Most attributes (source, lake, database, source-path sub-block contents, etc.) must be re-supplied in the .tf file to avoid a forced replacement.
Data Source
data "onehouse_flow" "lookup" {
name = "events-kafka"
}
output "flow_state" {
value = data.onehouse_flow.lookup.state
}
Limitations
- Source change is destructive. Changing
source,lake,database,table_name, orwrite_modeforces destroy + recreate. - Advanced configs are immutable in place.
delay_threshold_num_sync_intervals,deduplication_policy,table_configured_base_path,table_partition_stylecannot be changed viaALTER FLOWyet —SET ADVANCED_CONFIGSsupport is a follow-up.