Skip to main content

LakeCache Clusters

LakeCache Clusters provide a low-latency query engine for interactive SQL queries on your tables. LakeCache is optimized for fast read queries, and automatically scales compute resources based on query load.

Beta Feature

This feature is in beta, and might not yet be available in your project. Currently, AWS projects are supported, with GCP support coming soon.

Key Features

  • Low-latency SQL queries: Execute interactive SQL queries in seconds
  • Automatic scaling: Scales compute resources by adding or removing entire engines based on query load
  • Concurrency: Supports concurrent workloads by scaling out to multiple engines
  • OAuth authentication: Secure access using OAuth with support for Okta, Microsoft Entra, Google, and other identity providers
  • Catalog integration: Works with AWS Glue and Onehouse catalogs
  • Access control integration: Integrates with AWS Lake Formation for table-level access control (available only when using AWS Glue catalog)
  • BI tool integration: Connect using Postgres JDBC drivers with tools like DBeaver, Tableau, PowerBI, and Looker

Create a LakeCache Cluster

When creating a LakeCache Cluster, you'll configure the following:

Engine Size

Select an engine size that determines the compute resources for each engine. Engine sizes are available in t-shirt sizes:

  • 2x-small
  • x-small
  • small
  • medium
  • large
  • x-large
  • 2x-large
  • 3x-large
  • 4x-large

Each engine has a fixed structure: one driver and a fixed number of workers determined by the engine size. The engine size maps to specific instance types for the driver and workers, and determines the number of workers per engine. Instance types and the number of workers per engine are fixed and cannot be modified.

How Scaling Works

LakeCache scales at the engine level, not at the worker level. When scaling, entire engines (each with a fixed driver and fixed number of workers) are added or removed based on query load.

AWS Instance Configuration

Engine SizeDriversExecutorsDriver InstanceWorker InstanceCache Size per Engine
2x-small11r8gd.xlarger8gd.xlarge236 GB
x-small11r8gd.xlarger8gd.2xlarge355 GB
small11r8gd.2xlarger8gd.4xlarge712 GB
medium11r8gd.4xlarger8gd.8xlarge1,425 GB
large11r8gd.4xlarger8gd.16xlarge2,375 GB
x-large12r8gd.8xlarger8gd.16xlarge4,750 GB
2x-large14r8gd.8xlarger8gd.16xlarge8,550 GB
3x-large18r8gd.8xlarger8gd.16xlarge16,150 GB
4x-large116r8gd.8xlarger8gd.16xlarge31,350 GB
Cache Size

AWS R8GD instances include local NVMe SSD storage. LakeCache uses half of the total NVMe storage for caching Parquet files. The cache size shown is the total cache available per engine (driver + all workers combined).

GCP Instance Configuration (Coming Soon)

GCP support for LakeCache is coming soon. If you'd like early access, please reach out to the Onehouse team.

Engine SizeDriversExecutorsDriver InstanceWorker InstanceCache Size per Engine
2x-small11n2d-standard-4n2d-standard-4Coming Soon
x-small11n2d-standard-4n2d-standard-8Coming Soon
small11n2d-standard-8n2d-standard-16Coming Soon
medium11n2d-standard-16n2d-standard-32Coming Soon
large11n2d-standard-16n2d-standard-64Coming Soon
x-large12n2d-standard-32n2d-standard-64Coming Soon
2x-large14n2d-standard-32n2d-standard-64Coming Soon
3x-large18n2d-standard-32n2d-standard-64Coming Soon
4x-large116n2d-standard-32n2d-standard-64Coming Soon

Autoscaling Configuration

  • Min Engines: Minimum number of engines that will always be running
  • Max Engines: Maximum number of engines the Cluster can scale to

Catalog Configuration

Select the catalog containing your tables:

  • Onehouse Managed: Use the built-in Onehouse catalog (default)
  • AWS Glue: Connect to an AWS Glue catalog

Permissions Enforcement

Configure how access control is enforced:

  • None: No access control enforcement (default)
  • Lake Formation: Use AWS Lake Formation for table-level access control (available only when using AWS Glue catalog). Row-level and column-level permissions are not yet supported.
Access Control Requirements

LakeFormation access control is only available when using the AWS Glue catalog. When using the Onehouse catalog, access control enforcement is not yet supported.

Connect to a LakeCache Cluster

LakeCache Clusters expose a JDBC endpoint that you can connect to using standard PostgreSQL drivers. The Cluster uses PostgreSQL 18 with OAuth authentication support.

Networking Prerequisites

Connect from within the VPC

By default, LakeCache Clusters only accept traffic from within your VPC. You can submit queries to a LakeCache Cluster endpoint from clients within your VPC.

To submit queries with LakeCache clients outside the VPC, such as your local machine, follow these steps to connect through a VPN or bastion host.

Connect cloud tools

By default, LakeCache Clusters do not accept traffic from outside your VPC. For cloud tools that support SSH tunneling, follow these steps to connect through a bastion host.

If you must connect through a cloud tool outside the VPC that does not support SSH tunneling, such as dbt Cloud, please contact Onehouse support.

Get Connection Details

  1. In the Onehouse console, open the Clusters page and click into your LakeCache Cluster
  2. After the Cluster is provisioned (which may take a few minutes), you will see a LakeCache Cluster endpoint in the Onehouse console. Here you will see:
    • Connection host
    • Username
    • Password LakeCache connection details
      • If the Cluster's catalog is Glue and Lake Formation enforcement is enabled, you will authenticate via your Identity Provider to generate your JDBC connection password. Follow the link from the Onehouse console to authenticate. LakeCache connection detail with Lake Formation

Connect with JDBC

Use a standard PostgreSQL JDBC driver to connect. You'll need the following connection parameters from your Cluster's connection details:

ParameterDescription
HostThe connection host from your Cluster details
Port5432
Databasepostgres (or your specific database name)
UsernameThe username from your Cluster details
PasswordThe password from your Cluster details

Connecting with Lake Formation

When using Lake Formation for access control, you can optionally specify an IAM role to assume for querying. If you don't specify a role, the first IAM role from your authentication token is used by default.

To specify a different IAM role, add it to the connection using the options parameter:

jdbc:postgresql://host:5432/postgres?options=-c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name

Example with DBeaver:

  1. Open your connection settings
  2. Go to the Driver Properties tab
  3. Add a new property with name options
  4. Set the value to -c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name

Connect with psql

Example psql connection:

psql "host=host port=5432 dbname=postgres options='-c onehouse.lake_cache_iam_role=arn:aws:iam::account:role/role-name' user=db_user oauth_issuer=https://your-idp.com oauth_client_id=your_client_id oauth_scope=openid profile email"

When connecting, you'll receive a device token. Visit the activation URL provided and enter the code to complete authentication.

Query Execution

Spark SQL Syntax

While LakeCache Clusters use PostgreSQL JDBC drivers for connectivity, queries are executed using Spark SQL under the hood. You should use Spark SQL syntax when writing queries, even if your client shows warnings about PostgreSQL syntax.

Once connected, you can submit SQL queries directly to the LakeCache Cluster. The Cluster will:

  1. Parse and validate the query
  2. Check permissions using the configured access control provider (if enabled)
  3. Route the query to an available engine
  4. Execute the query with caching optimizations
  5. Return results via the JDBC connection

Query Caching

LakeCache implements multiple layers of caching to improve query performance:

  • File caching: Frequently accessed Parquet files are cached in memory and on SSD
  • Query result caching: Deterministic query results are cached at the driver and worker levels

Query result caching only applies to deterministic queries (queries without non-deterministic functions like rand() or timestamp-related functions) and only when the underlying tables have not been updated since the result was cached.

Monitoring

You can view OCU Usage for the Cluster in the Onehouse console.

Coming soon

In an upcoming release, you will be able to monitor your LakeCache Cluster through:

  • Query execution logs: View query history and execution details
  • Performance metrics: Track query latency and throughput
  • Cluster metrics: Monitor CPU utilization and autoscaling behavior

Contact us if any of these logs/metrics or other observability would be useful for you.

Limitations

  • Currently supports read-only queries. Write operations are not supported.
  • Access control for Onehouse catalog is not yet supported. Use AWS Glue with Lake Formation for access control.
  • Lake Formation access control supports table-level permissions only. Row-level and column-level permissions are not yet supported.
  • Tables must be registered in the selected catalog to be queryable.
  • Auto-stop is not yet supported. LakeCache Clusters must be stopped manually and cannot yet be configured to automatically stop after a period of inactivity.