Skip to main content

AWS S3

Description

Continuously and incrementally stream data directly from any S3 bucket into your Onehouse-managed lakehouse. Each file is processed exactly once. If the content of an object key is modified by overwriting it, Onehouse may or may not process the updated content, depending on when the object is consumed. From correctness and data completeness perspective, it is recommended to create a new file instead of modifying the content of an existing object.

Follow the setup guide within the Onehouse console to get started. Click Sources > Add New Source > S3.

Pre-requisites

  1. Ensure that you have granted permission to the bucket in the Terraform or CloudFormation configurations when you connected your cloud account.
  2. Your S3 bucket should not have any notification rule configured to it

Schema Information

By default, Onehouse will infer the source schema by reading a sample of the files to be ingested. Alternatively, you have the option to provide a schema for the incoming records via the configured schema registry.

Supported File Formats

Onehouse supports the following file formats for ingestion from S3:

✅ Parquet

✅ Avro

✅ JSON

✅ JSONL

✅ CSV

✅ ORC

✅ XML

Details on JSON files

Onehouse S3 source supports single-object JSON files, JSON file with a key and list as a value, JSONL files, and NDJSON files. JSON files with lists of object will require use of the Onehouse Explode Array Transformer.