Notebook Clusters

Notebook Clusters deploy a Jupyter notebook on Onehouse infrastructure to run interactive PySpark workloads on the Onehouse Quanton engine.

Beta Feature

This feature is in beta, and might not yet be available in your project.

Connect to the notebook

In the Onehouse console, open the Clusters page, click into your Notebooks Cluster.
At the top of the page, you will see the notebook connection details. Copy the notebook URL.
The notebook is exposed on an internal load balancer, accessible from within your VPC. Follow these steps to connect to the internal load balancer.
1. Note: If public access is enabled, you can skip this step. For projects with public access enabled, the notebook URL will start with https://dp-apps.cloud.onehouse.ai/.
Enter the event logs URL in your browser to open the Jupyter notebook UI. Sign in with the username and password from the Onehouse console.
You can now create and run .ipynb notebooks in the Jupyter UI.

Change to another Cluster

You can change to a different Notebook Cluster by signing out in the Jupyter notebook UI (File > Log out) and signing back in with the other Cluster's credentials.

Managing package dependencies

Notebooks come with several Python packages pre-installed. You may also install more packages with pip install.

Pre-installed packages

The following package come pre-installed on every 'Notebook' Cluster:

Apache Spark 3.5.2
Apache Hudi 0.14.1
Apache Iceberg 1.10.0

Installing packages for notebooks

You can install packages directly on a notebook Cluster. Follow these steps:

In your notebook, create a new cell and run pip install:
```
! pip install <package-name>
```
Restart the Jupyter kernel.
Your newly-installed packages will be picked up.

Quickstarts

Query an existing Onehouse table

Follow the steps below or download the sample notebook file.

In the Onehouse console, open the Clusters page. Click Add New Cluster.
Create a Cluster of type Notebook with the Onehouse Catalog. This configures the Cluster to work with Onehouse tables.
Follow these steps to connect to your Jupyter notebook.

In your notebook, create an Apache Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.getOrCreate()

print(f"✅ Spark {spark.version} initialized.")

Use the database containing the Onehouse table you want to query:
```
spark.sql("USE onehouse_db")
```
1. You can view your Onehouse databases with:
```
spark.sql("SHOW DATABASES").show()
```

Query the table:

spark.sql("SELECT * FROM my_table").show()

You can view the tables in your database with:
```
spark.sql("SHOW TABLES").show()
```

Create and query a Onehouse table

Follow the steps below or download the sample notebook file.

In the Onehouse console, open the Clusters page. Click Add New Cluster.
Create a Cluster of type Notebook with the Onehouse Catalog. This configures the Cluster to work with Onehouse tables.
Follow these steps to connect to your Jupyter notebook.

In your notebook, create an Apache Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.getOrCreate()

print(f"✅ Spark {spark.version} initialized.")

Set your database:
```
spark.sql("USE onehouse_db")
```
Create the database
You must first create a database in the Onehouse console or API. You cannot do this directly from the notebook.

Create a Onehouse table in the Apache Hudi format:

spark.sql("""
CREATE TABLE my_table (
    id BIGINT,
    name STRING,
    email STRING,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
)
USING hudi
TBLPROPERTIES (
    'type' = 'cow',
    'primaryKey' = 'id',
    'preCombineField' = 'updated_at',
    'hoodie.metadata.enable' = 'true',
    'hoodie.datasource.write.recordkey.field' = 'id',
    'hoodie.datasource.write.precombine.field' = 'updated_at',
    'hoodie.datasource.write.table.name' = 'my_table'
)
""")

Create the table in the Apache Hudi format

For tables to appear in the Onehouse console and be compatible with table services, you should create the table in the Apache Hudi format.

You don't need to include LOCATION when creating the table. This will automatically be set based on the cloud storage path of your database.

Verify the table creation:

spark.sql("DESCRIBE EXTENDED my_table").show(truncate=False)

Insert sample data:

spark.sql("""
    INSERT INTO my_table VALUES
        (1, 'Alice', 'alice@example.com', current_timestamp(), current_timestamp()),
        (2, 'Bob', 'bob@example.com', current_timestamp(), current_timestamp()),
        (3, 'Charlie', 'charlie@example.com', current_timestamp(), current_timestamp())
""")

Query the table:

spark.sql("SELECT id, name, email FROM my_table").show()

You can view the tables in your database with:
```
spark.sql("SHOW TABLES").show()
```

Check the table properties:

spark.sql("SHOW TBLPROPERTIES my_table").show(truncate=False)

View your table in the Onehouse console:
1. In the Onehouse console, open the Data page and navigate to your Database > Table. Here you will see an overview of the new table:
2. Tables created with the Onehouse catalog will automatically have table services enabled:
3. On the History tab for the table, you can view commits:

Query an external Apache Iceberg table

Follow the steps below or download the sample notebook file.

In the Onehouse console, open the Clusters page. Click Add New Cluster.
Create a Cluster of type Notebook and set the catalog to an Apache Iceberg REST Catalog (IRC), such as AWS Glue IRC or Snowflake Open Catalog. This configures the Cluster to work with Apache Iceberg tables.
Follow these steps to connect to your Jupyter notebook.

In your notebook, create an Apache Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.getOrCreate()

print(f"✅ Spark {spark.version} initialized.")

Use the database from your IRC containing the Apache Iceberg table you want to query:
```
spark.sql("USE my_database")
```
1. You can view your the databases in your catalog with:
```
spark.sql("SHOW DATABASES").show()
```

Query the table:

spark.sql("SELECT * FROM my_iceberg_table").show()

You can view the tables in your database with:
```
spark.sql("SHOW TABLES").show()
```

Limitations

By default, Notebook Clusters deploy 10 GB of attached storage volumes to store your notebooks. Contact Onehouse support if you require additional storage.
You may run into occasional "Spawn failed" errors in the Jupyter UI. If this occurs, you should sign out and sign back in with the credentials from the Cluster page in the Onehouse console.
The Apache Spark UI is not yet available for Notebook Clusters.

Connect to the notebook​

Managing package dependencies​

Pre-installed packages​

Installing packages for notebooks​

Quickstarts​

Query an existing Onehouse table​

Create and query a Onehouse table​

Query an external Apache Iceberg table​

Limitations​

Connect to the notebook

Managing package dependencies

Pre-installed packages

Installing packages for notebooks

Quickstarts

Query an existing Onehouse table

Create and query a Onehouse table

Query an external Apache Iceberg table

Limitations