Table Services
Overview
Onehouse Table Services automatically optimize your table for reading and writing performance. This document will describe how to configure and monitor services that run on your tables.
When you create a Stream Capture, cleaning and compaction table services are automatically created for the table Onehouse is managing. You cannot turn off cleaning and compaction for Onehouse managed tables, but you can modify their settings.
If you register External Tables and follow the pre-requisites documented, you can also configure Onehouse table services to run on the tables that you are writing with your own external pipelines.
Table Services Overview Page
When you open the Onehouse Console you can navigate to the 'Table Services' page which will show you the status of all table services running across all of your tables in your project. Use the search bar or the filters to quickly find the services you are interested in monitoring.
On this page you can see the current status of the table service:
- Active = The table service is ready to run
- Running = The table service is currently running
- Failed = The table service failed the most recent attempt to execute
- Paused = The table service is paused and not actively running
*Note these messages are only for table services executed standalone. Services that are a part of Onehouse Stream Captures will execute and display in the "Stream Captures" page
To see the full history of the table service execution, visit the 'History Tab' on the Table Details page.
To create, edit, or disable a table service, visit the 'Optimizations Tab' on the Table Details page.
Trigger Mode
For all the table services, we offer Automatic
and On Demand
trigger modes.
- Automatic mode will run the service (Clustering, Compaction and Cleaning) automatically based on the configured commit frequency. For MetaSync, the service will run automatically for each commit in the table.
- On Demand mode will allow you to manually trigger the service using the
RUN SERVICE IN TABLE
API orActions -> Trigger
button on the table service page.
Details about each table service and their settings are documented below:
Clustering
Clustering the table helps improve performance for reading data. Learn more about Clustering in the Hudi docs.
You can set the following Clustering configurations:
- Keys: Specify fields on which to cluster the data.
- Layout strategy: Select the strategy to use for clustering. By default, we recommend choosing Linear. For details on different layout strategies, you can see the Hudi layout strategies blog.
- Frequency: Specify how frequently to run clustering. By default, we recommend clustering data every 4 commits. Keep in mind that clustering more frequently will use more compute.
Usage Notes
- The Clustering service optimizes target file sizes and data sorting across files to ensure better query performance.
- Enabling Clustering on a table with existing data will only cluster new data written after Clustering is enabled.
- Clustering runs incrementally on unclustered data to optimize write performance. The layout optimization (i.e. sorting) is applied to the data within each clustering operation. In each run, clustering rewrites a maximum number of bytes to avoid long-running jobs and keeps a backlog of the unclustered files that will be clustering in the subsequent rounds.
- When a file group is Clustered, it will no longer be part of the active table, as Clustering will rewrite the data as a new file group. The original (non-active) file group will remain in storage until it is deleted by the Cleaner service.
Compaction
Compaction is available and required for MERGE-ON-READ tables (see table details above). Compaction merges incoming data into the table, which enables efficient writes to the table. Learn more about Compaction in the Hudi docs.
You can set the following Compaction configurations:
- Frequency: Specify how frequently to run compaction. By default, we recommend compacting data every 4 commits. Keep in mind that compacting data more frequently will use more compute.
- Bytes per compaction: Specify how many bytes to compact in each compaction. By default, we recommend using 512GB (512,000MB) per compaction.
Cleaning
Cleaning helps you reclaim space and keep your storage costs in check by removing old data committed to the table. Onehouse retains old data committed to the table to support time travel queries.
You can set the following Cleaning configurations:
- Frequency: Specify how frequently to run cleaning. By default, we recommend cleaning data every 4 commits.
- Time travel retention: Specify the number of days to retain commit history. Increasing this number will allow you to time travel further back, but will use more storage.
Metadata Sync
Metadata Sync allows you to sync your table to one or more catalogs connected to Onehouse. This feature can be used to
- Register a table to any of the listed catalogs
- Translate an Apache Hudi table to an Apache Iceberg or a Delta Lake table.
- Additional features like adding custom database for Iceberg tables in Glue and customizing the table name suffix for Iceberg tables are available through the ALTER SERVICE IN TABLE API.