AWS Glue Data Catalog
Description
AWS Glue Data Catalog is a fully managed, serverless data catalog service provided by Amazon Web Services. It enables users to discover, search, and manage metadata for their data assets across various data sources. With Glue Metastore, you can easily integrate and process data from multiple sources using AWS data services such as Amazon Redshift, Amazon Athena, and Amazon EMR.
Setup guide
- Enter a Name to identify the data catalog in Onehouse
- Press Create Glue Catalog
Multi-format Catalog Sync
Onehouse natively supports syncing to the Glue Data Catalog in multiple open table formats (Apache Hudi and Apache Iceberg using Apache XTable). This means that a single copy of your data will now be synced to the Glue Catalog as Hudi format and Iceberg format metadata, enabling table format interoperability - allowing you to use the best format for your use-case.
In order to set this up, select the formats that you would like to sync as and define the format suffix for the table name (Iceberg format will default to _iceberg). Thus any Iceberg format tables will be registered as tableName_iceberg in the Glue Catalog. An example of this can be found in the table below.
Format | Table Name (in catalog) |
---|---|
Hudi | tableName_ro (read optimized view) tableName_rt (real time or snapshot view) |
Iceberg | tableName_iceberg |
Onehouse managed Iceberg tables should not be written to via external writers - this could corrupt the data in the table
Glue Data Catalog integrations can only be created in AWS projects.
For Lake Formation integrated projects
If you are using Lake Formation with your Glue Data Catalog, your Onehouse role must have the necessary permissions to access the Glue Data Catalog resources.
High-level steps
The role that is responsible to access the Glue resources is onehouse-customer-eks-node-role-<requestId-prefix>
. This role is created when you deploy Onehouse in your account.
- In your Lake Formation console, go to the
Data permissions
tab. - Use the node role as the
Principal
to grant permissions to theNamed Data Catalog resources
. - Choose the appropriate
Catalog
,Databases
, andTables
resources to grant permissions to. - For flexibility, you may provide
Super
permissions to the role. This will allow Onehouse to access all resources in the Glue Data Catalog for the chosen resources. - At the very minimum, you will need to provide the following permissions to the role for the resources you want Onehouse to access.
Offering | Resource | Permissions |
---|---|---|
Observed Lake | Database | Create Table, Alter, Describe |
Table | Select, Insert, Describe, Alter, Drop | |
Managed Lake | Database | Create Table, Alter, Describe, Drop |
Table | Select, Insert, Describe, Alter, Drop, Delete |