Manage Tables
Overview
Tables are generated by the Stream Captures you create in Onehouse.
Onehouse stores your tables in the structure: Lake > Database > Table
. You can browse your tables from the Data
tab in Onehouse. Click into a table to see statistics and configurations.
Table details
At the top of the table page, you will see details about your table.
- DFS path: The path of the table in cloud storage. To learn more, follow the docs on discovering data in storage.
- Partition Key: The fields used for partitioning your table. Partitions help you organize data more efficiently by grouping records together based on a specific partition field.
- Type: Onehouse tables can be type 'MERGE_ON_READ' or 'COPY_ON_WRITE', based on the Apache Hudi table types. Onehouse automatically creates most tables as `MERGE_ON_READ' for optimal performance. Check out this blog to learn more about the table types.
Observability
The Overview
tab in the table page shows charts and statistics about your table.
Optimizations
Onehouse automatically optimizes your table performance for reading and writing data. You may further configure these optimizations.
Clustering
Clustering the table helps improve performance for reading data. Learn more about Clustering in the Hudi docs.
You can set the following Clustering configurations:
- Keys: Specify fields on which to cluster the data.
- Layout strategy: Select the strategy to use for clustering. By default, we recommend choosing Linear.
- Frequency: Specify how frequently to run clustering. By default, we recommend clustering data every 4 commits. Keep in mind that clustering more frequently will use more compute.
Usage Notes
- The Clustering service optimizes target file sizes and data sorting across files to ensure better query performance.
- Enabling Clustering on a table with existing data will only cluster new data written after Clustering is enabled.
- Clustering runs in-line, meaning that it will run serially after a configured number of commits.
- Clustering runs incrementally on unclustered data to optimize write performance. The layout optimization (i.e. sorting) is applied to the data within each clustering operation. In each run, clustering rewrites a maximum number of bytes to avoid long-running jobs and keeps a backlog of the unclustered files that will be clustering in the subsequent rounds.
- When a file group is Clustered, it will no longer be part of the active table, as Clustering will rewrite the data as a new file group. The original (non-active) file group will remain in storage until it is deleted by the Cleaner service.
Compaction
Compaction is available and required for MERGE-ON-READ tables (see table details above). Compaction merges incoming data into the table, which enables efficient writes to the table. Learn more about Compaction in the Hudi docs.
You can set the following Compaction configurations:
- Frequency: Specify how frequently to run compaction. By default, we recommend compacting data every 4 commits. Keep in mind that compacting data more frequently will use more compute.
- Bytes per compaction: Specify how many bytes to compact in each compaction. By default, we recommend using 512GB (512,000MB) per compaction.
Usage Notes
- Compaction runs in-line, meaning that it will run serially after a configured number of commits.
Cleaning
Cleaning helps you reclaim space and keep your storage costs in check by removing old data committed to the table. Onehouse retains old data committed to the table to support time travel queries.
You can set the following Cleaning configurations:
- Frequency: Specify how frequently to run cleaning. By default, we recommend cleaning data every 4 commits.
- Time travel retention: Specify the number of days to retain commit history. Increasing this number will allow you to time travel further back, but will use more storage.
Usage Notes
- Cleaning runs in-line, meaning that it will run serially after a configured number of commits.
Operations
Savepoint
Savepoints allow you to retain a point-in-time snapshot of the table.
You can trigger Savepoints for a previous commit, or schedule automatic Savepoints on a daily, weekly, or monthly interval.
Set the Savepoint expiry for how long you want to retain each savepoint. A longer expiry will provide you more flexibility to restore the table, but will increase storage costs.
Restore
After creating Savepoints, you can restore the table to a previous Savepoint. This operation will delete all data committed after the Savepoint, and cannot be undone.
You should pause all Stream Captures for the table while performing a restore, since writing to the table will fail during the restore process. Reading from the table may also fail if you attempt to read data committed after the Savepoint.
Data Quality Quarantine
If there is an issue with a row of data (i.e. incompatible schema) we will write that row to a separate quarantine table so your Stream Capture can continue uninterrupted.
The configurations for Data Quality Quarantine are managed within the Stream Capture that writes to the table.