Examine Onehouse quarantine tables with AWS Glue and Athena

Problem: On AWS, Onehouse quarantine tables don’t have Glue metadata so that they cannot be queried using Athena.

Solution: By using AWS Glue crawlers and Amazon Athena, you can discover the schema and partitioning of Hudi tables, and create the metadata.

Overview

Create an AWS Glue Crawler: Configure an AWS Glue crawler to target the S3 path where the Hudi tables are located. You can specify the maximum depth for the crawler to traverse the S3 paths.
Run the AWS Glue Crawler: Execute the crawler to inspect the S3 paths. The crawler will catalog the schema information, including new tables, schema updates, and partition information, in the AWS Glue Data Catalog.
View the Hudi Table Definition: After the crawler run is complete, you can view the Hudi table definition in the AWS Glue console. The crawler will create table definitions for both read-optimized and real-time tables if applicable (e.g., sample_hudi_mor_table_ro for read-optimized and sample_hudi_mor_table_rt for real-time).
Query with Amazon Athena: With the table definitions created in the AWS Glue Data Catalog, you can now open the Amazon Athena console and start querying the Hudi tables using standard SQL queries.

Visual Steps

Go to AWS Glue, and choose 'Crawlers' in the left menu pane. Then, click 'Create crawler'.

Enter the crawler name, ex: Onehouse quarantine data, click 'Next'

Click 'Add a data source'

Add data source – specify Hudi in S3

Add your s3 path, this will be your Onehouse customer bucket, datalake folder and Onehouse database name, like: s3://onehouse-customer-bucket-c61092fc/datalake/acme_retail_bronze/. Then click 'Add a Hudi data source'.

Click 'Next'

If already have a role named '"Implementing Stream Processing on the Data Lakehouse for High Performance, Operational Efficiency and Cost Savings." ' choose it, otherwise, click 'Create new IAM role' (it will be configured automatically).