Table Services
Overview
Onehouse Table Services automatically optimize your table for reading and writing performance. This document will describe how to configure and monitor services that run on your tables.
When you create a flow, cleaning and compaction table services are automatically created for the table Onehouse is managing. You cannot turn off cleaning and compaction for Onehouse managed tables, but you can modify their settings.
If you register External Tables and follow the prerequisites documented, you can also configure Onehouse table services to run on the tables that you are writing with your own external pipelines.
The table services described on this page are referring to operations on Apache Hudi tables. Remember, if you are also using Iceberg or Delta Lake, the optimizations applied to the tables benefit all formats. Optimizing file sizes, partitions, implementing sort keys, cleaning up versioned files, etc. translate into benefits for all formats.
Table Services Overview Page
When you open the Onehouse Console you can navigate to the 'Table Services' page which will show you the status of all table services running across all of your tables in your project. Use the search bar or the filters to quickly find the services you are interested in monitoring.
On this page you can see the current status of the table service:
- Active = The table service is ready to run
- Running = The table service is currently running
- Failed = The table service failed the most recent attempt to execute
- Paused = The table service is paused and not actively running
Click on any table service to load the table services Details Page
Table Services Details Page
On the table services details page you can see fine-grained detail about the service. The Status tab shows basic statistics and a chart showing the table service performance over time.
The Setup tab shows you the configurations set for this table service. You can modify the settings by clicking the "Edit" button on the far right.
The History tab shows you the history of all actions performed by this table service. You can drill into more details with the Details link on the far right.
Trigger Mode
For all the table services, we offer Automatic and On Demand trigger modes.
- Automatic mode will run the service (Clustering, Compaction and Cleaning) automatically based on the configured commit frequency. For MetaSync, the service will run automatically for each commit in the table.
- On Demand mode will allow you to manually trigger the service using the
RUN SERVICE IN TABLEAPI orActions -> Triggerbutton on the table service page.
Table Service Offerings
Details about each table service and their settings are documented below:
Clustering
Clustering helps improve query performance for your table by optimizing file sizes and sorting data. Learn more about clustering in the Apache Hudi docs.
You can set the following clustering configurations:
- Keys: Specify fields on which to cluster the data.
- Layout strategy: Select the strategy to use for clustering. By default, we recommend choosing Linear. For details on different layout strategies, you can see the Hudi layout strategies blog.
- Frequency: Specify how frequently to run clustering. By default, we recommend clustering data every 4 commits. Keep in mind that clustering more frequently will use more compute.
- Sorting: Enable sorting to co-locate data based on values of the tables sort field(s). When sorting is disabled, clustering will only perform file-sizing.
- Note: Sorting may not be disabled for tables created by Flows in mutable write mode, as they already have optimally-sized files.
- Bootstrap: Enable bootstrap to consider all existing and new data in the table for clustering. When bootstrap is disabled, only new data will be clustered.
- The bootstrap configuration only applies when the clustering table service is first created — it sets the initial checkpoint for the clustering plan. After creation, this setting has no effect (for both automatic and on-demand modes).
Usage Notes
- Clustering Batches: To prevent excessively long runs and out-of-memory errors, clustering plans are divided into batches based on data size and number of commits. Each clustering run processes one batch and maintains a backlog of remaining data to be clustered.
- Large datasets may require multiple runs to cluster all batches. Each run processes all or part of the remaining backlog. In Automatic mode, subsequent runs trigger automatically based on your configured frequency. In On-Demand mode, you must manually trigger each run.
- Data is sorted only within each batch, not across the entire table. This means clustering does not guarantee global sort order across your table.
- Accelerating Clustering: To speed up large clustering runs, increase the Max OCU for the Cluster or dedicate a separate Cluster to the table.
- Cleaning Process: Clustering rewrites data into new file groups, making the previous files inactive in the table. These inactive files remain in storage until removed by the cleaning service.
- Editing Sort Configurations: Editing the sort keys or sort strategy with bootstrap enabled will re-bootstrap the table. If bootstrap is not enabled, the new sort configurations will be applied only to new records.
- Re-clustering the Table: To re-cluster the entire table, edit the sort configurations with bootstrap enabled. Alternatively, you can delete and re-create the clustering table service with bootstrap enabled, but this will only re-cluster the full table if the previous clustering commit has been cleaned from the table by the cleaning service; otherwise, it will resume from the previous clustering checkpoint.
Compaction
Compaction is available and required for MERGE-ON-READ tables. Compaction merges incoming data into the table, which enables efficient writes to the table. Learn more about Compaction in the Hudi docs.
You can set the following Compaction configurations:
- Frequency: Specify how frequently to run compaction. By default, we recommend compacting data every 4 commits. Keep in mind that compacting data more frequently will use more compute.
- Bytes per compaction: Specify how many bytes to compact in each compaction. By default, we recommend using 512GB (512,000MB) per compaction.
Cleaning
Cleaning helps you reclaim space and keep your storage costs in check by removing old data committed to the table. Onehouse retains old data committed to the table to support time travel queries.
You can set the following Cleaning configurations:
- Frequency: Specify how frequently to run cleaning. By default, we recommend cleaning data every 4 commits.
- Time travel retention: Specify the number of days to retain commit history. Increasing this number will allow you to time travel further back, but will use more storage.
Metadata Sync
Metadata Sync allows you to sync your table to one or more catalogs connected to Onehouse. This feature can be used to
- Register a table to any of the listed catalogs that support metadata sync.
- Translate an Apache Hudi table to an Apache Iceberg or a Delta Lake table.
- Additional features like adding custom database for Iceberg tables in Glue and customizing the table name suffix for Iceberg tables are available through the ALTER SERVICE IN TABLE API.