Flows

Configure ingestion pipelines that read from sources and write to Onehouse tables.

Helper dataclass

Flows use one helper dataclass for partition definitions. Import it from onehouse_python_sdk.resources.sql.commands:

from onehouse_python_sdk.resources.sql.commands import PartitionKeyField

PartitionKeyField(
    field: str,                # column name
    *,
    partition_type: str = "",  # "" | "DATE_STRING" | "EPOCH_MILLIS" | "EPOCH_MICROS"
    input_format: str = "",    # e.g. "yyyy-mm-dd"
    output_format: str = "",   # e.g. "yyyy-mm-dd"
)

For non-timestamp partition keys, leave partition_type, input_format, and output_format as empty strings.

Methods

Method	Description
`create_flow`	Create a new ingestion flow
`alter_flow`	Change one aspect of an existing flow
`delete_flow`	Delete a flow
`describe_flow`	Show full configuration for a flow
`show_flows`	List all flows in the project

`create_flow`

create_flow(
    name: str,
    *,
    source: str,
    lake: str,
    database: str,
    table_name: str,
    write_mode: str,
    cluster: str,
    performance_profile: str | None = None,
    source_data_schema: str | None = None,
    catalogs: Sequence[str] | None = None,
    transformations: Sequence[str] | None = None,
    record_key_fields: Sequence[str] | None = None,
    partition_key_fields: Sequence[PartitionKeyField] | None = None,
    precombine_key_field: str | None = None,
    sorting_key_fields: Sequence[str] | None = None,
    min_sync_frequency_mins: int | None = None,
    quarantine_enabled: bool | None = None,
    validations: Sequence[str] | None = None,
    table_type: str | None = None,
    options: Mapping[str, str] | None = None,
    unsafe_raw: bool = False,
    timeout: float | None = None,
    poll_interval: float | None = None,
)

Parameter	Required	Type / values
`name`	yes	`str`
`source`	yes	`str` — name of an existing source
`lake`	yes	`str` — target lake
`database`	yes	`str` — target database
`table_name`	yes	`str` — target table
`write_mode`	yes	`"IMMUTABLE"`, `"MUTABLE"`
`cluster`	yes	`str` — compute cluster that runs the flow
`performance_profile`	no	`"BALANCED"`, `"FASTEST_READ"`, `"FASTEST_WRITE"`
`source_data_schema`	no	`str`
`catalogs`	no	`Sequence[str]` — catalog names to sync to
`transformations`	no	`Sequence[str]` — transformation names to apply
`record_key_fields`	no	`Sequence[str]`
`partition_key_fields`	no	`Sequence[PartitionKeyField]`
`precombine_key_field`	no	`str`
`sorting_key_fields`	no	`Sequence[str]`
`min_sync_frequency_mins`	no	`int`
`quarantine_enabled`	no	`bool`
`validations`	no	`Sequence[str]` — validation names to apply
`table_type`	no	`"COPY_ON_WRITE"`, `"MERGE_ON_READ"`
`options`	no	`Mapping[str, str]` — source-specific runtime options

Example

from onehouse_python_sdk.resources.sql.commands import PartitionKeyField

client.create_flow(
    "events_pipeline",
    source="my_kafka_source",
    lake="analytics",
    database="events",
    table_name="page_views",
    write_mode="MUTABLE",
    cluster="ingest_cluster",
    catalogs=["my_glue_catalog"],
    record_key_fields=["id"],
    partition_key_fields=[
        PartitionKeyField("date", partition_type="DATE_STRING",
                          input_format="yyyy-mm-dd", output_format="yyyy-mm-dd"),
    ],
    min_sync_frequency_mins=5,
    options={"kafka.topic.name": "page_views"},
)

`create_flow` — `WITH` options reference

Source-specific options and advanced configs are passed via the options dict. All keys and values are strings.

Kafka source

Key	Description
`kafka.topic.name`	Name of the source Kafka topic.
`kafka.startingOffsets`	`'earliest'` (default) or `'latest'`.

S3 source

Key	Description
`s3.folder.uri`	`s3://bucket/prefix/` folder to ingest from.
`s3.file.format`	`AVRO`, `JSON`, `CSV`, `ORC`, `PARQUET`, or `XML`.
`s3.file.extension`	File extension, e.g. `'.parquet'`, `'.jsonl'`.
`s3.source.bootstrap`	`'TRUE'` or `'FALSE'`.
`s3.source.if_infer_fields_from_source_path`	`'TRUE'` to extract field values from path segments like `field=value`.
`s3.source.fields_to_infer`	Comma-separated field names to extract from the path, e.g. `'schema,table,date'`.
`s3.source.file_path_pattern`	Optional glob applied under `s3.folder.uri`.
`s3.file.csv.header`	`'TRUE'` if the first row is a header.
`s3.file.json.multiline`	`'TRUE'` for single-record-per-file (pretty-printed) JSON.
`s3.file.xml.row.tag`	XML element that delimits one record, e.g. `'Product'`.
`s3.file.xml.value.tag`	Synthetic field name for element text content with attributes (default `'_VALUE'`).
`s3.file.xml.null.value`	String token treated as `null`.
`s3.file.xml.ignore.namespace`	`'TRUE'` to strip namespace prefixes from element names.

GCS source — same as S3 with gcs. prefix (e.g. gcs.folder.uri, gcs.file.format, etc.).

Onehouse table source

Key	Description
`source.table.name`	Name of the source Onehouse table.
`source.table.database`	Database containing the source table.
`source.table.lake`	Lake containing the source table.

Postgres source

Key	Description
`postgres.table.name`	Postgres table to ingest.
`postgres.schema.name`	Postgres schema name.

MySQL source

Key	Description
`mysql.table.name`	MySQL table to ingest.

Oracle source

Key	Description
`oracle.table.name`	Oracle table to ingest.
`oracle.schema.name`	Oracle schema (owner).

Schema registry (Kafka sources)

Key	Description
`schema.registry.type`	`'glue'`, `'confluent'`, `'jar'`, or `'file'`.
`schema.registry.glue.name`	Glue registry name.
`schema.registry.confluent.servers`	Confluent Schema Registry URL.
`schema.registry.confluent.key`	Confluent API key.
`schema.registry.confluent.secret`	Confluent API secret.
`schema.registry.proto.jar.location`	S3 path to the proto schema JAR.
`schema.registry.file.base.path`	Base path for file-based schema registry.
`schema.registry.file.full.path`	Full path to a specific schema file.

Cross-source advanced configs

Key	Description
`table.configured.base.path`	Custom storage location for the table (disables Clean & Restart).
`table.partition.style`	`'default'` or `'hive'`.
`flow.delayThreshold.numSyncIntervals`	Multiplied by sync frequency to set the delay threshold. `'0'` disables the Delayed state.
`flow.deduplicationPolicy`	`'none'` (default) or `'drop'` — deduplicates append-only flows by record key.

`alter_flow`

alter_flow(
    name: str,
    *,
    state: str | None = None,
    source: str | None = None,
    cluster: str | None = None,
    performance_profile: str | None = None,
    min_sync_frequency_mins: int | None = None,
    transformations: Sequence[str] | None = None,
    validations: Sequence[str] | None = None,
    quarantine_enabled: bool | None = None,
    advanced_configs: bool = False,
    options: Mapping[str, str] | None = None,
    unsafe_raw: bool = False,
    timeout: float | None = None,
    poll_interval: float | None = None,
)

Requires exactly one of: state, source, cluster, performance_profile, min_sync_frequency_mins, transformations, validations, quarantine_enabled, or advanced_configs=True.

Parameter	Required	Type / values
`name`	yes	`str`
`state`	no	`"PAUSE"`, `"RESUME"`, `"CLEAN_AND_RESTART"`
`source`	no	`str`
`cluster`	no	`str`
`performance_profile`	no	`"BALANCED"`, `"FASTEST_READ"`, `"FASTEST_WRITE"`
`min_sync_frequency_mins`	no	`int`
`transformations`	no	`Sequence[str]`
`validations`	no	`Sequence[str]`
`quarantine_enabled`	no	`bool`
`advanced_configs`	no	`bool` — when `True`, emits `SET ADVANCED_CONFIGS` with `options`
`options`	no	`Mapping[str, str]`

Examples

# Pause a flow
client.alter_flow("events_pipeline", state="PAUSE")

# Move to a different cluster
client.alter_flow("events_pipeline", cluster="ingest_cluster_v2")

# Update advanced configs
client.alter_flow(
    "events_pipeline",
    advanced_configs=True,
    options={"hoodie.parquet.compression.codec": "zstd"},
)

`delete_flow`

delete_flow(name: str, *, unsafe_raw=False, timeout=None, poll_interval=None)

Example

client.delete_flow("events_pipeline")

`describe_flow`

describe_flow(name: str, *, unsafe_raw=False, timeout=None, poll_interval=None)

Example

result = client.describe_flow("events_pipeline")

`show_flows`

show_flows(*, timeout=None, poll_interval=None)

Example

result = client.show_flows()

Helper dataclass​

Methods​

create_flow​

Example​

create_flow — WITH options reference​

alter_flow​

Examples​

delete_flow​

Example​

describe_flow​

Example​

show_flows​

Example​

Helper dataclass

Methods

`create_flow`

Example

`create_flow` — `WITH` options reference

`alter_flow`

Examples

`delete_flow`

Example

`describe_flow`

Example

`show_flows`

Example