DataHub
Description
DataHub is an open-source metadata platform that enables users to discover, understand, and manage metadata across a wide range of data sources. It provides a unified data catalog for searching and browsing datasets, schemas, and other metadata, as well as a metadata ingestion framework for integrating data from various sources. With DataHub, organizations can create a single source of truth for their data assets, improving data discoverability and governance.
Cloud Provider Support
- AWS: ✅ Supported
- GCP: ✅ Supported
Setup guide
- Enter a Name to identify the data catalog in Onehouse.
- Select "DataHub" as the Type.
- Use the DataHub Generalized Metadata Service (GMS) endpoint for the server URL. If you are using the DataHub Core (docker quickstart), the value is http://localhost:8080.
- Enter the Data Platform Name: hudi. The recommended default value is "hudi".
- Enter the Dataset Env: PROD. Currently only the value of "PROD" is valid. There are plans to support all the values in https://raw.githubusercontent.com/datahub-project/datahub/master/li-utils/src/main/pegasus/com/linkedin/common/FabricType.pdl.
- Enter the Auth Token. If you are using the DataHub Core (docker quickstart), you need to run
export METADATA_SERVICE_AUTH_ENABLED=truebefore docker compose startup to enable the feature to generate personal access tokens.
Once the Onehouse Flow has run with the DataHub as one of the catalogs to sync with, you can see the data in DataHub by logging into the DataHub UI. For DataHub Core (docker quickstart), the URL for the UI is http://localhost:9002.
Example connection setup for DataHub in Onehouse

Example token setup in DataHub Core

Example data in DataHub Core
