Postgres CDC
Description
Continuously stream data directly from a Onehouse managed Postgres CDC into your Onehouse managed lakehouse.
Follow the setup guide within the Onehouse console to get started. Click Sources > Add New Source > Postgres CDC.
Note that Postgres CDC source can only be created in AWS projects.
Pre-requisites
- Ensure that you have granted permission for database sources in the Terraform or CloudFormation configurations when you connected your cloud account.
- Your Postgres table must have at least one primary key.
- An AWS Glue Schema Registry that you can use.
Schema Information
The schema of the Postgres table will be used as the source schema. Data types from Postgres will be mapped to their Parquet equivalent.
Usage Notes
- When creating a Stream Capture with a Postgres CDC source, ensure the Write Mode is set to Mutable if you want to update records in the table using CDC logs. If you prefer to land the raw CDC logs in the table without updating records, you may use Append-only Write Mode to improve write performance.
- Onehouse will ingest records in the Debezium Postgres CDC format. To use these events for updating records in the table (rather than ingesting the raw events), apply the Convert CDC Data transformation.
- Onehouse will provision a MSK Kafka cluster in your cloud account when you create your first Stream Capture with a Postgres CDC source. Subsequent Stream Captures using any Postgres CDC source will use the same MSK cluster. The MSK cluster will remain active until you shut it down in your cloud account (Note: Ensure that all Postgres & MySQL sources have been deleted before shutting down the MSK cluster to avoid impacting other sources.).
- When creating a Stream Capture with a Postgres CDC source, Onehouse will initially bootstrap the existing data from Postgres before starting incremental ingestion.
Guide for creating an AWS Glue Schema Registry
If you do not currently have an AWS Glue Schema Registry you can use, you can use these steps to create a schema registry.
- Go to the AWS Glue console
- Select Schema registries under Data catalog in the navigation pane
- Select Add registry
- Enter a name for the registry, such as "MyDemoSchemaReg"
- Enter an optional description for the registry
- Select Add registry
More information about AWS Glue Schema Registry can be found on the AWS Glue Documentation pages
Guide for setting up AWS RDS Postgres
If you are using Amazon Relational Database Service (RDS) for your Postgres database, follow these steps to ensure Onehouse can connect to your database.
Configure a database parameter group
By default, Amazon RDS provides a set of parameter group configurations for basic database operations. These do not enable all the necessary features that are needed for Debezium to perform its change data capture operations, most notably logical replication.
If you have already created a custom parameter group for your version of PostgreSQL, then you can skip this section. If you have not, this will walk you through how to copy and customize the default configuration groups for your needs.
-
Navigate to RDS, this should take you to the RDS Dashboard.
-
Click Parameter Groups in the Resources panel on the dashboard.
-
Click Create parameter group button in the top right.
-
Configure your parameter group:
Parameter group family: postgresXX.YY (where XX.YY is closest to your major version) Group name: custom-postgresXX.YY Description: Custom parameter group for postgres XX.YY
You should be able to click on the name of your parameter group and it should take you to a tabular screen where each parameter is listed, its current configured value, a description of the parameter and other pertinent information.
At the top of this tabular layout is a search box you can to use to quickly locate and edit parameters.
Click the Edit parameters
button in the top right and then click in the search box.
- Search for
rds.logical_replication
and set its value to1
.
Once you've made these changes click Save changes
to persist the parameter changes to your custom parameter group.
Then you need to modify RDS configs to apply the new parameter group:
Make sure you apply changes immediately. You must reboot the database to enable the parameter group changes.
Create a dedicated database user (e.g. cdc_user) for Onehouse.
A user needs access to read the replication slots. Log in as the PostgreSQL admin user to setup the CDC user (cdc_user) for Onehouse.
Grant permissions:
-- global changes
CREATE USER cdc_user WITH ENCRYPTED PASSWORD '<password>';
GRANT rds_superuser TO cdc_user;
-- schema specific changes
GRANT ALL ON SCHEMA public TO cdc_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO cdc_user;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA public TO cdc_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL PRIVILEGES ON TABLES TO cdc_user;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT ALL PRIVILEGES ON SEQUENCES TO cdc_user;
The rds_superuser
role has the REPLICATION
attribute, which is required for logical replication.
Other roles with the REPLICATION
privilege can also be used i.e. rds_replication
in place of rds_superuser
.
Create a publication that publishes all changes in all tables:
-- database specific change
CREATE PUBLICATION alltables FOR ALL TABLES;
Heartbeat - Important
Using the Postgres application user (the user that the application teams used to create all the other tables), create the heartbeat table and grant privileges:
-- database specific changes
CREATE TABLE onehouse_heartbeat (
id INTEGER DEFAULT 1 PRIMARY KEY,
updated_at timestamp
);
INSERT INTO onehouse_heartbeat DEFAULT VALUES ON CONFLICT (id) DO UPDATE SET updated_at=NOW();
Create Ingress to PostgreSQL
Onehouse requires read-access to PostgreSQL.
Add ingress for port 5432
to allow connections from the onehouse-eks cluster.
Go to the security group of PostgreSQL, create ingress for 5432 port for the security group that governs the onehouse-eks cluster.
(Optional) Grant privileges for a specific database
Run the following command to grant the CDC user privleges for a specific database:
-- database specific change
GRANT ALL PRIVILEGES ON DATABASE <database_name> TO cdc_user;
Troubleshooting
- If you're facing the error
The connection attempt failed
on Onehouse console, it could mean one of two things:- Credential related:
- Make sure the username and password are correct. Also, ensure that the user has the necessary permissions.
- Make sure the host url is correct.
- Network related:
- Make sure the security group of the Postgres database allows inbound traffic from the Onehouse EKS cluster.
- If the database is in a different VPC, make sure there is a working VPC peering connection between the two VPCs.
- Credential related:
You can test the connection between the database to Onehouse EKS cluster by creating a sample EC2 instance in the same VPC, subnet and with the same security group as the Onehouse EKS cluster. Then follow the document from AWS to connect to the database from the EC2 instance.
- If the connection is successful, please reach out to Onehouse support.
- If the connection is unsuccessful, please check the security group rules and network configurations.
You can learn more about Debezium Postgres connector here.