Skip to main content

MySQL CDC

Description

Continuously stream data directly from a Onehouse managed MySQL CDC into your Onehouse managed lakehouse.

Follow the setup guide within the Onehouse console to get started. Click Sources > Add New Source > MySQL CDC.

Note that MySQL CDC sources can only be created in AWS projects.

Pre-requisites

  1. Ensure that you have granted permission for database sources in the Terraform or CloudFormation configurations when you connected your cloud account.
  2. Your MySQL table must have at least one primary key.

Schema Information

The source schema will match the schema of the tables in your source database. Data types from the source database will be mapped to their Parquet equivalent.

Usage Notes

  • When creating a Stream Capture with a MySQL CDC source, ensure the Write Mode is set to Mutable if you want to update records in the table using CDC logs. If you prefer to land the raw CDC logs in the table without updating records, you may use Append-only Write Mode to improve write performance.
  • Onehouse will ingest records in the Debezium MySQL CDC format. To use these events for updating records in the table (rather than ingesting the raw events), apply the Convert CDC Data transformation.
  • Onehouse will provision a MSK Kafka cluster in your cloud account when you create your first Stream Capture with a MySQL CDC source. Subsequent Stream Captures using any MySQL CDC source will use the same MSK cluster. The MSK cluster will remain active until you shut it down in your cloud account (Note: Ensure that all Postgres & MySQL sources have been deleted before shutting down the MSK cluster to avoid impacting other sources.).
  • When creating a Stream Capture with a MySQL CDC source, Onehouse will initially bootstrap the existing data from MySQL before starting incremental ingestion.

Enabling GTID

Global transaction identifiers (GTIDs) is not a mandatory requirement for ingesting CDC data but enabling it has the following advantages:

  • You do not have to re-bootstrap data if you switch to another replica with identical data. (To switch host please reachout to customer support)
  • Your database can failover to another replica with identical data and continue from where it previously stopped.

Quickstart Guide with Amazon RDS:

This guide shows you how to set up a MySQL database instance with all the prerequisites for Onehouse ingestion. This examples shows Amazon RDS, but you may perform the same steps on any MySQL database. For more detailed information, refer to the Amazon RDS documentation.

1. Create a Security Group

  1. Go to the Amazon VPC Console: Navigate to the Security Groups section.
  2. Click on the “Create security group” button.
  3. Configure Basic Settings:
    • Name: Enter the name (e.g., app_security_group).
    • Description: Provide a brief description.
    • VPC: Select the VPC from the dropdown.
  4. Configure Inbound Rules:
    • Click on “Add Rule”.
    • Rule 1:
      • Type: Custom TCP Rule
      • Protocol: TCP
      • Port Range: 3306
      • Source: Use the VPC IP range where the Onehouse dataplane EKS cluster is hosted (e.g., 10.0.0.0/16 for IPv4, ::/0 for IPv6 if necessary). Avoid using 0.0.0.0/0 as it allows open access from all sources.
  5. Configure Outbound Rules:
    • By default, all outbound traffic is allowed, but you can configure specific rules:
    • Rule 1:
      • Type: Custom TCP Rule
      • Protocol: TCP
      • Port Range: 3306
      • Destination: Use the VPC IP range where the Onehouse dataplane EKS cluster is hosted (e.g., 10.0.0.0/16 for IPv4, ::/0 for IPv6 if necessary). Avoid using 0.0.0.0/0 to limit open access.
  6. Click on “Create security group”.

2. Create a DB Subnet Group

  1. Go to the Amazon RDS Console: Navigate to the Subnet Groups section.
  2. Click on the “Create DB Subnet Group” button.
  3. Configure Basic Settings:
    • Name: Enter the name (e.g., db_subnet_group_name).
    • Description: Provide a brief description.
    • VPC: Select the VPC from the dropdown.
  4. Add Subnets:
    • In the Add Subnets section, select the subnets where the RDS instances will reside.
  5. Click on “Create”.

3. Configure a Database Parameter Group

  1. Go to the Amazon RDS Console: Navigate to the Parameter Groups section.
  2. Click on the “Create parameter group” button.
  3. Configure Basic Settings:
    • Parameter group family: Select the appropriate DB engine family.
    • Group Name: Enter the name (e.g., db_pg_name).
    • Description: Provide a brief description.
  4. Modify Parameters (Optional):
    • Select your newly created parameter group.
    • Click on “Edit parameters”.
    • Modify the following parameters:
      • binlog_format: Set the value to ROW and the apply method to immediate.
      • innodb_lock_wait_timeout [optional]: Set the value to 600 and the apply method to immediate.
        • This parameter sets the amount of time in seconds that a transaction will wait for a row lock before giving up. A higher value can prevent short-lived deadlocks during long-running transactions.
      • gtid-mode: Set the value to ON.
        • This parameter enables Global Transaction Identifiers (GTIDs), which provide a unique identifier for each transaction. This is essential for replication setups to ensure that transactions are consistently applied across servers.
      • enforce_gtid_consistency: Set the value to ON.
        • This parameter ensures that only statements that are safe for GTID-based replication are allowed. It enforces GTID consistency, which is necessary when gtid-mode is enabled.
    • Modify the parameters as needed.
  5. Click on “Save changes”.

4. Create a MySQL Database Instance

  1. Go to the Amazon RDS Console: Navigate to the Databases section.
  2. Click on the “Create database” button.
  3. Select Standard Create.
  4. Engine Options:
    • Select the MySQL engine.
    • Choose the desired engine version.
  5. Templates: Choose the appropriate template based on your needs (e.g., Production, Dev/Test).
  6. Settings:
    • DB instance identifier: Enter a unique identifier for your instance.
    • Master username: Enter the username
    • Master password: Enter the password
  7. DB Instance Size:
    • Select the instance class.
  8. Storage:
    • Allocated storage: Enter the storage size.
  9. Connectivity:
    • Virtual Private Cloud (VPC): Select the appropriate VPC.
    • Subnet group: Select the DB subnet group created earlier (use db_subnet_group_name).
    • Publicly accessible: Set according to your needs.
    • VPC security group: Select the security group created earlier.
  10. Database Options:
    • Database name: Enter the name.
  11. Additional Configuration:
    • Parameter group: Select the parameter group created earlier (use parameter_group_name).
    • Backup, Monitoring, and Maintenance: Configure these settings based on your needs.
      • Backup Retention Period: Set a non-zero value (e.g., 7 days) to enable binary logging (bin log).
  12. Skip Final Snapshot: Enable this if you want to skip taking a final snapshot before deletion.
  13. Click on “Create database”.

Create a dedicated database user (e.g. cdc_user) for Onehouse.

For Onehouse to ingest data, it requires a MySQL user account. This MySQL user must have appropriate permissions on all databases and tables for which Onehouse captures changes.

Log in as the MySQL admin user to setup the new user:

-- Create the user
CREATE USER 'cdc_user'@'%' IDENTIFIED BY 'password';

-- Grant permissions
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT, LOCK TABLES ON *.* TO 'cdc_user'@'%';

-- Finalize the user’s permissions
FLUSH PRIVILEGES;

Create Ingress to MySQL

Onehouse requires read-access to MySQL.

Add ingress for port 3306 to allow connections from the onehouse-eks cluster.

If not already done, go to the security group of MySQL, create ingress for 3306 port for the security group that governs the onehouse-eks cluster.

(Optional) Grant privileges for a specific database

-- Grant permissions on a specific database
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON database_name.* TO 'cdc_user'@'%' IDENTIFIED BY 'password';

-- Finalize the user’s permissions
FLUSH PRIVILEGES;

Notes: Replace password with a strong password of your choice. Replace database_name with the actual name of the specific database.

Heartbeat - Important

Using the MySQL application user (the user that the application teams used to create all the other tables), create the heartbeat table and grant privileges:

-- database specific changes
CREATE TABLE onehouse_heartbeat (
id INT DEFAULT 1 PRIMARY KEY,
updated_at TIMESTAMP
);

GRANT INSERT, UPDATE ON <db-name>.onehouse_heartbeat TO 'cdc_user'@'%';

INSERT INTO onehouse_heartbeat (id) VALUES (1) ON DUPLICATE KEY UPDATE updated_at = CURRENT_TIMESTAMP;