Create a Cluster
Open the Clusters page in the Onehouse console to create a new Cluster. Below we will cover the configurations to set up the Cluster.
You can create multiple Clusters of the same type to isolate different workloads.
Basic configurations
- Name: The name by which to identify the Cluster.
- Cluster Type: Select from the following Cluster types. This cannot be changed after creation.
- Managed: Run Flows to ingest data and Table Services to optimize your tables.
- SQL: Run SQL workloads on the Quanton engine. Submit queries through a JDBC Endpoint or the Onehouse SQL Editor.
- Spark: Create and run Jobs to execute Apache Spark code in Python, Java, or Scala on the Quanton engine.
- Open Engines: Deploy open source compute engines on Onehouse infrastructure with Open Engines.
- Notebook (beta feature): Deploy a Jupyter notebook on Onehouse infrastructure to run interactive PySpark workloads on the Onehouse Quanton engine.
OCU configurations
Specify OCU limits to constrain the min/max Onehouse Compute Units (OCU) the Cluster will use per hour. This will determine the how many instances the Cluster can use, based on the hourly OCU cost of your selected instance type(s).
- Max OCU / Hour: Maximum OCU the Cluster will use per hour. Set this to manage your costs.
- Min OCU / Hour: Minimum OCU the Cluster will use per hour. Set this higher if you need to keep the Cluster warm.
Setting Max OCU for your Clusters can help you confidently keep costs under a budget. If your data volumes or complexity of the workload change and your Cluster usage hits its Max OCU, the Cluster will not continue scaling up. This may lead to delays in data processing, so it is important to consider how your workloads grow or fluctuate.
Instance type configurations
Specify the instance types for the Cluster. Learn more about the available instance types, custom instance types, and OCU costs here.
- Worker Type: Specify the instance type for the Cluster's workers (aka executors).
- Spot Instances: Optionally enable spot instances for the Cluster's workers.
- Not enabled by default.
- Enabling this may help reduce your cloud provider compute costs for workloads that do not require high-availability.
- Enabling this will not affect OCU pricing.
- This configuration only enables spot instances for workers. Drivers will always use on-demand instances.
- Driver Type: Specify the instance type for the Cluster's driver(s).
- When set to
Auto(the default option), drivers will use the same instance type as workers.
- When set to
Managed Clusters can write to up to N tables, where N = 75 × (number of driver node cores). This limit applies regardless of how many Flows or Table Services are writing to those tables.
Catalog
The catalog configuration is required for SQL, Spark, and Open Engines (Trino and Flink) Clusters. The specified catalog will be used by the Cluster to read and write data.
SQL and Spark Clusters
SQL and Spark Clusters can connect to the following catalogs:
- Onehouse Managed (recommended): Use the built-in catalog that integrates seamlessly with all Onehouse services. This option is recommended, unless you explicitly need to connect to an external catalog.
- External Iceberg REST Catalog (IRC): Connect to an external Iceberg REST Catalog (IRC) as your primary catalog. Currently, you can integrate with AWS Glue Iceberg REST Catalog and Snowflake Open Catalog. Note the following limitations with IRC:
- Catalog events (such as table creation) will not be registered to Onehouse.
- Tables created with IRC cannot run table services.
- Additional limitations are called out in the documentation for each specific catalog.
Open Engines Clusters
Open Engines Trino and Apache Flink Clusters must connect to an external catalog as their primary catalog. We plan to add support for the Onehouse Managed catalog in the future.
Attached storage
Currently available in AWS projects only.
Clusters automatically provision attached storage volumes, such as Amazon EBS, to support additional disk spilling for memory-intensive workoads.
The following attached storage volumes will be provisioned based on your Cluster instance types.
| Instance Cores | Storage Size | Volume Configuration | IOPS | Throughput |
|---|---|---|---|---|
| 4 | 250 GB | 2 X 125GB | 6000 | 250 MiB/s |
| 8 | 500 GB | 4 X 125GB | 9000 | 500 MiB/s |
| 16 | 1000 GB | 4 X 250GB | 12000 | 750 MiB/s |
| 32 | 2000 GB | 4 X 500GB | 12000 | 750 MiB/s |
| 48 | 3000 GB | 3 X 1TB | 16000 | 1000 MiB/s |
| 64 | 4000 GB | 4 X 1TB | 16000 | 1000 MiB/s |
| 96 | 6000 GB | 3 X 2TB | 16000 | 1000 MiB/s |
| 192 | 12000 GB | 4 X 3TB | 16000 | 1000 MiB/s |
You can also specify custom storage sizes via API commands: CREATE CLUSTER and ALTER CLUSTER.
Cluster usage notifications
You may set an OCU Limit Notification Threshold to receive a notification when the Cluster has scaled to X% of the Maximum OCU. Usage is calculated as the average over the past hour.
The default notification threshold is 80%, but you can set this to any value, or set 0% to disable notifications.
Notifications are received via the Onehouse UI, email, and (if configured) Slack.