LakeCache Clusters

LakeCache Clusters provide a low-latency query engine for interactive SQL queries on your tables. LakeCache is optimized for fast read queries, and automatically scales compute resources based on query load.

Beta Feature

This feature is in beta, and might not yet be available in your project. Currently, AWS projects are supported, with GCP support coming soon.

Key Features

Low-latency SQL queries: Execute interactive SQL queries in seconds
Automatic scaling: Scales compute resources by adding or removing entire engines based on query load
Concurrency: Supports concurrent workloads by scaling out to multiple engines
OAuth authentication: Secure access using OAuth with support for Okta, Microsoft Entra, Google, and other identity providers
Catalog integration: Works with AWS Glue and Onehouse catalogs
Access control integration: Integrates with AWS Lake Formation for table-level access control (available only when using AWS Glue catalog)
BI tool integration: Connect using Postgres JDBC drivers with tools like DBeaver, Tableau, PowerBI, and Looker

Create a LakeCache Cluster

When creating a LakeCache Cluster, you'll configure the following:

Engine Size

Select an engine size that determines the compute resources for each engine. Engine sizes are available in t-shirt sizes:

2x-small
x-small
small
medium
large
x-large
2x-large
3x-large
4x-large

Each engine has a fixed structure: one driver and a fixed number of workers determined by the engine size. The engine size maps to specific instance types for the driver and workers, and determines the number of workers per engine. Instance types and the number of workers per engine are fixed and cannot be modified.

How Scaling Works

LakeCache scales at the engine level, not at the worker level. When scaling, entire engines (each with a fixed driver and fixed number of workers) are added or removed based on query load.

AWS Instance Configuration

Engine Size	Drivers	Executors	Driver Instance	Worker Instance	Cache Size per Engine
2x-small	1	1	r8gd.xlarge	r8gd.xlarge	236 GB
x-small	1	1	r8gd.xlarge	r8gd.2xlarge	355 GB
small	1	1	r8gd.2xlarge	r8gd.4xlarge	712 GB
medium	1	1	r8gd.4xlarge	r8gd.8xlarge	1,425 GB
large	1	1	r8gd.4xlarge	r8gd.16xlarge	2,375 GB
x-large	1	2	r8gd.8xlarge	r8gd.16xlarge	4,750 GB
2x-large	1	4	r8gd.8xlarge	r8gd.16xlarge	8,550 GB
3x-large	1	8	r8gd.8xlarge	r8gd.16xlarge	16,150 GB
4x-large	1	16	r8gd.8xlarge	r8gd.16xlarge	31,350 GB

Cache Size

AWS R8GD instances include local NVMe SSD storage. LakeCache uses half of the total NVMe storage for caching Parquet files. The cache size shown is the total cache available per engine (driver + all workers combined).

GCP Instance Configuration (Coming Soon)

GCP support for LakeCache is coming soon. If you'd like early access, please reach out to the Onehouse team.

Engine Size	Drivers	Executors	Driver Instance	Worker Instance	Cache Size per Engine
2x-small	1	1	n2d-standard-4	n2d-standard-4	Coming Soon
x-small	1	1	n2d-standard-4	n2d-standard-8	Coming Soon
small	1	1	n2d-standard-8	n2d-standard-16	Coming Soon
medium	1	1	n2d-standard-16	n2d-standard-32	Coming Soon
large	1	1	n2d-standard-16	n2d-standard-64	Coming Soon
x-large	1	2	n2d-standard-32	n2d-standard-64	Coming Soon
2x-large	1	4	n2d-standard-32	n2d-standard-64	Coming Soon
3x-large	1	8	n2d-standard-32	n2d-standard-64	Coming Soon
4x-large	1	16	n2d-standard-32	n2d-standard-64	Coming Soon

Autoscaling Configuration

Min Engines: Minimum number of engines that will always be running
Max Engines: Maximum number of engines the Cluster can scale to

Catalog Configuration

Select the catalog containing your tables:

Onehouse Managed: Use the built-in Onehouse catalog (default)
AWS Glue: Connect to an AWS Glue catalog

Permissions Enforcement

Configure how access control is enforced:

None: No access control enforcement (default)
Lake Formation: Use AWS Lake Formation for table-level access control (available only when using AWS Glue catalog). Row-level and column-level permissions are not yet supported.

Access Control Requirements

LakeFormation access control is only available when using the AWS Glue catalog. When using the Onehouse catalog, access control enforcement is not yet supported.

Connect to a LakeCache Cluster

LakeCache Clusters expose a JDBC endpoint that you can connect to using standard PostgreSQL drivers. The Cluster uses PostgreSQL 18 with OAuth authentication support.

Networking Prerequisites

Connect from within the VPC

By default, LakeCache Clusters only accept traffic from within your VPC. You can submit queries to a LakeCache Cluster endpoint from clients within your VPC.

To submit queries with LakeCache clients outside the VPC, such as your local machine, follow these steps to connect through a VPN or bastion host.

Connect cloud tools

By default, LakeCache Clusters do not accept traffic from outside your VPC. For cloud tools that support SSH tunneling, follow these steps to connect through a bastion host.

If you must connect through a cloud tool outside the VPC that does not support SSH tunneling, such as dbt Cloud, please contact Onehouse support.

Get Connection Details

In the Onehouse console, open the Clusters page and click into your LakeCache Cluster
After the Cluster is provisioned (which may take a few minutes), you will see a LakeCache Cluster endpoint in the Onehouse console. Here you will see:
- Connection host
- Username
- Password
  - If the Cluster's catalog is Glue and Lake Formation enforcement is enabled, you will authenticate via your Identity Provider to generate your JDBC connection password. Follow the link from the Onehouse console to authenticate.

Connect with JDBC

Use a standard PostgreSQL JDBC driver to connect. You'll need the following connection parameters from your Cluster's connection details:

Parameter	Description
Host	The connection host from your Cluster details
Port	`5432`
Database	`postgres` (or your specific database name)
Username	The username from your Cluster details
Password	The password from your Cluster details

Connecting with Lake Formation

When using Lake Formation for access control, you can optionally specify an IAM role to assume for querying. If you don't specify a role, the first IAM role from your authentication token is used by default.

To specify a different IAM role, add it to the connection using the options parameter:

jdbc:postgresql://host:5432/postgres?options=-c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name

Example with DBeaver:

Open your connection settings
Go to the Driver Properties tab
Add a new property with name options
Set the value to -c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name

Connect with psql

Example psql connection:

psql "host=host port=5432 dbname=postgres options='-c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name' user=db_user oauth_issuer=https://your-idp.com oauth_client_id=your_client_id oauth_scope=openid profile email"

When connecting, you'll receive a device token. Visit the activation URL provided and enter the code to complete authentication.

Query Execution

Spark SQL Syntax

While LakeCache Clusters use PostgreSQL JDBC drivers for connectivity, queries are executed using Spark SQL under the hood. You should use Spark SQL syntax when writing queries, even if your client shows warnings about PostgreSQL syntax.

Once connected, you can submit SQL queries directly to the LakeCache Cluster. The Cluster will:

Parse and validate the query
Check permissions using the configured access control provider (if enabled)
Route the query to an available engine
Execute the query with caching optimizations
Return results via the JDBC connection

Query Caching

LakeCache implements multiple layers of caching to improve query performance:

File caching: Frequently accessed Parquet files are cached in memory and on SSD
Query result caching: Deterministic query results are cached at the driver and worker levels

Query result caching only applies to deterministic queries (queries without non-deterministic functions like rand() or timestamp-related functions) and only when the underlying tables have not been updated since the result was cached.

Monitoring

You can view OCU Usage for the Cluster in the Onehouse console.

Coming soon

In an upcoming release, you will be able to monitor your LakeCache Cluster through:

Query execution logs: View query history and execution details
Performance metrics: Track query latency and throughput
Cluster metrics: Monitor CPU utilization and autoscaling behavior

Limitations

Currently supports read-only queries. Write operations are not supported.
Access control for Onehouse catalog is not yet supported. Use AWS Glue with Lake Formation for access control.
Lake Formation access control supports table-level permissions only. Row-level and column-level permissions are not yet supported.
Tables must be registered in the selected catalog to be queryable.
Auto-stop is not yet supported. LakeCache Clusters must be stopped manually and cannot yet be configured to automatically stop after a period of inactivity.

Key Features​

Create a LakeCache Cluster​

Engine Size​

AWS Instance Configuration​

GCP Instance Configuration (Coming Soon)​

Autoscaling Configuration​

Catalog Configuration​

Permissions Enforcement​

Connect to a LakeCache Cluster​

Networking Prerequisites​

Connect from within the VPC​

Connect cloud tools​

Get Connection Details​

Connect with JDBC​

Connecting with Lake Formation​

Connect with psql​

Query Execution​

Query Caching​

Monitoring​

Coming soon​

Limitations​