Skip to main content

Migrate from Amazon EMR

This document outlines the process of migrating workloads from Amazon EMR to Onehouse, detailing foundational differences, migration patterns for data and code, best practices, and tooling options.

Code Migration

Onehouse Jobs are fully compatible with Apache Spark. You may reuse any code from EMR that is not leveraging AWS-specific features.

Refer here for details on how to create Onehouse Jobs.

Glue Catalog & LakeFormation

Our integration with Lake Formation enables support for named data catalog resources and tag-based access control through AWS Glue Catalog.

Onehouse Jobs allow you to sync data to Glue catalog asynchronously, or to use Glue as your primary catalog. If you need to use Glue as your primary catalog, contact Onehouse support for configuration.

Job Flows

Job Flows in EMR on EC2 enable you to spin up an EMR cluster, run steps (ie. Spark jobs), then spin down the cluster when those steps are completed. This is a potentially cost-efficient approach when you don't need a persistent cluster.

Onehouse supports a similar setup via our API and Airflow operators. You can map each 'step' in EMR to a Job in Onehouse.

To replicate this setup with Onehouse APIs, you can do the following:

  1. Create a Cluster with CREATE CLUSTER.
  2. Create Job definitions with CREATE JOB.
  3. Run the Jobs with RUN JOB
    1. The API will return a Job run ID.
    2. If using the Airflow HTTP operator, this response will be captured.
  4. Check for Job run completion with DESCRIBE JOB_RUN.
    1. Poll repeatedly for completion.
    2. If using the Airflow HTTP operator, you can use the response_check parameter.
  5. Delete Cluster with DELETE CLUSTER.