Skip to main content

Create a Java or Scala (JAR) Job

JAR (Java archive) Jobs in Onehouse allow you to run an Apache Spark application in Java or Scala.

JAR Job Quickstart

Prerequisites

  • Use JDK 1.17 for the best compatibility.
  • Ensure you have the Project Admin role. This is currently a requirement for creating Jobs.
  • A Cluster with the 'Spark' type to run the Jobs. You will need access to an existing Cluster in your project, or create one by following these steps:
    1. In the Onehouse console, navigate to the Clusters page.
    2. Click 'Add new Cluster'.
    3. Select 'Spark' as the Cluster type.

Step 1: Build and upload your JAR

Build and upload your JAR containing the executable and dependencies.

  1. Download the demo Java application: SparkJava.zip.
  2. In your local environment, unzip the downloaded file, then navigate to the SparkJava parent directory.
  3. Build a fat JAR with the Jackson JSON library included:
    ./gradlew shadowJar
  4. The JAR will be created under the SparkJava/build/libs/ directory.
  5. Upload the JAR to an object storage bucket accessible by Onehouse.
    tip

    Onehouse will always have access to the bucket named onehouse-customer-bucket-<request_id_prefix>. You can retrieve the request ID for your project by clicking on the settings popup in the bottom left corner of the Onehouse console.

Step 2: Create the Job in Onehouse

  1. First, ensure you have the Project Admin role. This is currently a requirement for creating Jobs.
  2. In the Onehouse console, open Jobs and click Create Job.
  3. Configure the Job by filling in the following fields:
    1. Name: Specify a unique name to identify the Job.
    2. Type: Select 'JAR'.
    3. Parameters: Specify parameters for the Job, including the executable filepath and JAR class. Ensure you follow the JAR Job parameters guidelines. You can copy-paste the following parameters for the SparkJava example, replacing the JAR path with your own:
      ["--class", "org.example.SparkDataFrameExample", "s3://onehouse-customer-bucket-12345/demo-jobs/SparkJava-1.0-SNAPSHOT.jar"]
  4. Create the Job. This will not immediately run the Job.

Step 3: Run the Job

After the Job is created, you can trigger a run. Learn more about running Jobs here.

  1. In the Onehouse console, navigate to the Jobs page.
  2. Open the Job you'd like to run, then click Actions > Run. You can also use the RUN JOB API command.
  3. When the Job starts running, you can view the driver logs in the Onehouse console by clicking into the Job run. Read more here on monitoring Job runs.

Configuring JAR Job parameters

You can configure the parameters for your Job as a JSON array of strings. You should include the required parameters, and be sure to add parameters in the proper order.

OrderParameterDescription
1--classThe main class to execute (required for JAR jobs)
2--confOptional Spark configuration properties (can be repeated)
3/path/to/jarThe path to your executable JAR in Cloud Storage (required)
4arg1, arg2, …Arguments passed to your main class

Example:

[
"--class", "com.example.MySparkApp",
"--conf", "spark.executor.memory=4g",
"--conf", "spark.executor.cores=2",
"/path/to/my-spark-app.jar",
"arg1", "--flag1"
]
warning

Follow these guidelines to ensure your Jobs run smoothly:

  • Add any spark.extraListeners in parameters. Do not add these in your code directly.
  • If you are setting spark.driver.extraJavaOptions or spark.executor.extraJavaOptions, you must add "-Dlog4j.configuration=file:///opt/spark/log4j.properties".
  • Do not set the following Apache Spark configurations:
    • spark.kubernetes.node.selector.*
    • spark.kubernetes.driver.label.*
    • spark.kubernetes.executor.label.*
    • spark.kubernetes.driver.podTemplateFile
    • spark.kubernetes.executor.podTemplateFile
    • spark.metrics.*
    • spark.eventLog.*

Managing Library Dependencies

You may install Java Maven or Gradle libary dependencies for your Jobs. Dependencies are installed by each Job's driver, rather than at the Cluster-level, to prevent dependency conflicts between Jobs running on the same Cluster.

Pre-Installed Libraries

The following libraries come pre-installed on every 'Spark' Cluster:

  • Apache Spark 3.5.2
  • Apache Hudi 0.14.1
  • Apache Iceberg 1.10.0 (only if the Cluster uses an Iceberg REST Catalog)
tip

In the build.gradle file for your Job, you should import pre-installed libraries as compileOnly dependencies. This keeps your JAR size small and prevents version conflicts.

Example:

dependencies {
compileOnly group: "org.apache.hudi", name: "hudi-spark3.5-bundle_2.12", version: "0.14.1"
}

Install Libraries for JAR Jobs

You may upload any Java libraries as JAR files to your cloud storage bucket, then reference them in the Job parameters. Below are the steps to do this:

  1. Upload your dependencies as JAR files to a cloud storage bucket that Onehouse can access.
  2. When you create a Job, reference the JAR files in the Job Parameters with "--conf", followed by "spark.jars=<JARS>". Example:
    "--class", "com.example.MyApp", "--conf", "spark.jars=s3a://path/to/dependency1.jar,s3a://path/to/dependency2.jar", "s3a://path/to/myapp.jar"