External Tables (aka Observed Lake)
Overview
Onehouse offers fully-managed ELT/ETL pipelines and ingestion services so you never have to worry about operational burdens of taking care of your own compute engines. If you prefer to manage your own pipelines outside of Onehouse, we also allow you to integrate your existing tables as Onehouse External Tables. When you register external tables with Onehouse you can enable observability and table optimization features inside Onehouse.
Pre-requisites
To register External Tables, Onehouse needs permissions to the storage location where you are storing your data. Permissions to storage are configured within the onboarding steps: AWS link, GCP link.
As of today, you can register Apache Hudi tables as External Tables. Registering Apache Iceberg and Delta Lake tables is coming soon, contact us if you are interested in registering these tables as well.
Registering External Tables
Follow these steps to register External Tables in Onehouse:
-
Navigate to ‘Data’ and click ‘Create Lake’:
-
Provide a ‘Name’ for the lake, select Observed Lake for ‘Type’ and provide path prefix of Hudi Tables as the ‘Root Path’. Click ‘Save’:
Note, the path prefix provided for Observed Lake creation must be unique across all lakes created in Onehouse.
Onehouse will run a job to discover all tables within the path prefix provided. This discovery process is expected to take a few minutes depending on the size and number of your tables.
Ensure every Hudi table under a single parent path has a unique name. The Onehouse discovery process will only register the first table it finds with a particular name, and any subsequent tables with the same name in different subfolders will be ignored.
For example, if you have two tables named table under the parent path s3://bucket/parent/
:
s3://bucket/parent/folder-1/table
s3://bucket/parent/folder-2/table
Onehouse will discover the first table, i.e. folder-1/table
, but will skip the second one in folder-2
because the name table has already been found.
External Table Synchronization
Metadata for External Tables is synchronized once every 5 minutes.
After the initial discovery job, if you add more tables within the same storage path prefix, these new tables will be discovered by an operation which runs once every hour.
Table Observer for External Tables
As soon as you register your External Tables, you will be able to see the tables in the 'Data' page collected under the 'Lake' name you provided during registration. This will unlock all of the charts, metrics, and observability that comes with Onehouse fully managed tables. See Table Details page for more information
Table Services for External Tables
If you turn on Onehouse Table Optimizer for your external tables, Onehouse can run advanced optimizations for cleaning, compaction, clustering, etc. In this configuration your external pipelines will continue writing to the tables and Onehouse will also write to the same table with asynchronous operations. To enable this multi-writer scenario there are a few pre-requisites to meet:
Pre-requisites
- Your Apache Hudi tables must be >= version 0.14.0
- You must create a lock provider and configure ALL of your existing and future external writers to use this lock provider.
You also will integrate this lock provider and enable concurrency control on your External tables as documented below.
- Failure to do so can result in corrupted tables
- You should turn off all Apache Hudi table services as documented below from your existing and future external writers to avoid inefficient competing of the same services with different configurations
Step 1: Create a Lock Provider and Register it with Onehouse
Follow the Apache Hudi documentation to create your own lock provider.
Once you create a lock provider, you can integrate it with Onehouse following these steps:
-
Navigate to Settings -> Integrations:
-
Click 'Configure' on the Lock Provider tile:
-
Provide a ‘Name’, select Zookeeper or DynamoDB for ‘Provider’.
-
For Zookeeper, provide a comma separated list of host:port for ‘Zookeeper Servers’
-
For DynamoDB, provide a table name and region
- The DynamoDB table must be in the same account as the Onehouse project
- Make sure an attribute with the name "key" is present in the DynamoDB lock table. The key attribute should be the partition key and you don't have to specify the sort key.
A lock provider is now registered and available to use with Onehouse External Tables.
Step 2: Configure your external writers
Mandatory multi-writer configuration: Make sure all existing and future writers to this table are configured in
multi-writer mode with the lock provider shared by Onehouse and your external services. Follow this
Hudi documentation for the configuration details.
Also ensure that your writers are using UTC by setting hoodie.table.timeline.timezone=UTC
.
If you do not make these configuration changes to your existing writer pipeline, the tables are in risk of concurrency failures and corruption.
Mandatory disable external table-services: Make sure all table services are disabled from all existing and future writers to this table. Follow guidance in Apache Hudi documentation for how to disable all table services. If you do not disable your own table services, there could be significant resources wasted if Onehouse and your table services have configurations which compete with each other.
You must disable all services for the table in Onehouse before you enable or disable the Hudi Metadata Table for your table or enable/disable a feature of the Hudi Metadata Table such as column level statistics or record level index. Failure to do so may result in data loss or corruption.
Step 3: Enable Concurrency Control on your Tables
Before turning on table services for External Tables, you need to configure concurrency control for the tables.
-
Navigate to your desired tables and click ‘Concurrency Control’ from the 3-dot table menu. (This option is unavailable if a lock provider is not configured.)
-
Click 'Enable Concurrency'
Reminder, you can configure these settings in bulk fashion with the Onehouse APIs
Step 4: Turn on Desired Table Services
Now that you have properly configured your external writers and your Onehouse External Tables, now you can turn on Onehouse Table Services. Follow the instructions in the Table Services documentation for how to configure and monitor your table services.