Jobs
Jobs allow you to submit Spark workloads in Java, Scala, or Python to run on Onehouse with autoscaling, observability, automatic table management, and more built-in.
Architecture
You can submit Spark workloads through Jobs, which execute code on Onehouse-managed instances (AWS EC2 or Google Compute Engine) within your VPC. These instances are pre-configured to run Spark with Apache Hudi tables.
Pre-Installed Libraries
The following libraries are pre-installed on Onehouse Spark Clusters:
- Apache Spark 3.5.2
- Apache Hudi 0.14.1
Billing
Jobs are billed at the same OCU rate as other instances in the product, as described in usage and billing.
Prerequisite: Set Up a Lock Provider
If you have not already set up a lock provider for your project, follow the steps bellow to set it up.
Onehouse supports Apache Zookeeper and Amazon DynamoDB lock providers to ensure safe operations when a table has multiple writers. This is a prerequisite for running Jobs on Onehouse.
Follow these steps to set up your lock provider:
- Create an Apache Zookeeper or DynamoDB lock provider.
- Apache Zookeeper: Create a Zookeeper that is accessible from your VPC.
- DynamoDB: Create a DynamoDB table in the same AWS account as your Onehouse project. Include an attribute with the name "key" and set this as the partition key of the DynamoDB table.
- In the Onehouse console, navigate to Settings > Integrations > Lock Provider. Add the lock provider you created in step 1.
- All Onehouse writers within the project will now use the lock provider you added. The SQL endpoint can concurrently write to tables with other Onehouse writers such as Stream Captures and Table Services.
If you are using DynamoDB as your lock provider, Onehouse requires additional access. In your Onehouse Terraform script or CloudFormation template, ensure that enableDynamoDB
is set to true
.
With the lock provider configured, it's enabled automatically, eliminating the need for explicit definition in your Spark code. Concurrency control is seamlessly integrated when running Onehouse Jobs and its default settings are located in s3://onehouse-customer-bucket-XXXXX/hudi_configs/onehouse-spark-code/hudi_defaults/hudi-defaults.conf.
Additional Tips on the Lock Provider & Concurrency:
- If you are writing to Onehouse tables with an external (non-Onehouse) writer, you should apply the same lock provider configs found on the table details page to avoid corrupting tables.
- When concurrent writers attempt to write to the same file group, one writer will fail gracefully (using Apache Hudi's Optimistic Concurrency Control). If you plan to write to the same file group with concurrent writers, it's useful to add failure retry logic in your orchestration and to consider temporarily pausing a writer if transactions repeatedly fail. Note that this only applies to transactions writing to the same file group.
- After a lock provider is added in the Onehouse console, it cannot be modified or removed.