Skip to main content

Review Architecture

Overview

Onehouse has a secure architecture, where the control plane runs in the Onehouse’s AWS account, and the data-processing systems run within your AWS account/ VPC.

It is important to understand that our product architecture is designed to help you maintain control of all of your data by leaving the data in the cloud buckets owned by your AWS accounts. When you use Onehouse for data management, the compute resources associated with those activities are also provisioned in your account with autoscaling and auto termination features. As a result, you will incur costs for the cloud infrastructure spun up in your AWS accounts and projects using the Onehouse product.

Onehouse infrastructure deployment and usage is designed to optimize performance and cost. By default, all nodes are deployed in a single availability zone to minimize network costs. For users preferring maximum availability, we can set up node pools across multiple availability zones too.

Architecture Diagram

EKS Cluster

Onehouse provisions an EKS cluster in customer’s account that runs an agent service and several PODs that run Spark jobs, Spark Infrastructure (Spark Operator), Observability Stack (Prometheus), and debugging tools.

Onehouse manages the cluster size and load balancing across the cluster to meet specific SLAs for the ingestion and other data management jobs.

The EKS cluster is created within an existing VPC in customer’s AWS account. The EKS cluster runs within a private subnets but we use a public subnet with a NAT to ensure secure internet connection for the EKS nodes.

We create a separate security group for the EKS cluster. We do not open any incoming ports.

Cluster endpoint access

The default configuration is public + private endpoint access but the public endpoint is restricted to the Onehouse control-plane's VPC. We have setup a bastion host in our VPC to access the cluster.

Private EKS Cluster

We also support the private endpoint access configuration. To enable this, we will setup a bastion host (EC2 Instance) in your VPC and restrict the cluster access to this host. Onehouse will access the bastion host using AWS Session Manager.

This setup improves security in a couple of key ways. First, all traffic to your cluster’s API server stays within your VPC — nothing goes over the public internet. Second, any kubectl commands or API requests must come from inside the VPC.

NOTE: Onehouse does not manage the Bastion host.

Agent

The agent that runs in customer’s AWS account maintains a long running gRPC connection over HTTP/TCP port 9090 with a controller that runs in the Onehouse AWS account.

📘 The agent initiates the connection, so we require internet connectivity for the EKS cluster nodes.

We do not need any firewall ports to be opened since the agent connects to the controller and not the other way around.

All the API calls from the UI are routed to the agent via the controller using the above long running gRPC stream.

The API calls from controller to agent are only used to manage the data plane in the customer account. We only transfer metadata that contains info such as schema, table/ database names etc.

The actual data is never transferred from agent to controller. We do provide a table preview feature, where data is transferred, but you can disable this feature in the Onehouse dashboard.

Accessing AWS Resources

To ensure that the Onehouse’s data plane (EKS cluster within the customer account) can ingest data, the EKS cluster requires connectivity to customer’s AWS resources such as Kafka, S3, etc.

For S3 access, we use IAM based permissions, so that the IAM role of the EKS nodes has access to the required S3 buckets. For Kafka however, we need to have network connectivity.

Compute and Storage Resources

Onehouse provisions EC2 m8g.xlarge instances within your AWS account to process data. These instances are deployed and managed via the EKS cluster in your AWS account.

For each EC2 instance, Onehouse provisions 256 GB of storage via EBS gp3 volumes. This low-cost storage enables you to process more data per instance for faster processing and reliable handling of workload spikes.