Skip to main content

Syncs and Delays

Sync Process

Flows process data incrementally through recurring syncs based on the specified minimum sync frequency. Each sync processes all the data available at the point of time of the trigger.

Each sync performs the following actions (in order):

  1. Flow wakes up and checks if new data is available to process.
  2. Starts processing all new data that was available at the point of time of trigger.
  3. If a Source Data Schema is specified, that schema is applied all incoming records; records that do not match the schema are quarantined or fail the Flow.
    1. Note: For S3 and GCS sources, incoming files that do not match the provided Source Data Schema will cause the Flow to fail (not quarantine data). You may alternatively set the Flow to infer the source data schema and quarantine invalid records by applying a data quality validation for the target schema.
  4. Applies all transformations to the incoming records.
  5. Applies all data quality validations to the transformed records. Records that do not match the validations are quarantined or fail the Flow.
  6. Writes transformed records to the destination table.
  7. Syncs to catalog(s) and performs XTable metadata translation.
  8. Performs any table services scheduled for the sync.
  9. When complete, the Flow waits until next trigger interval.

Example: Basic Sync

  1. Flow triggers T1 at min0 and starts sync1 ingestion for all data it found available in Kafka at min0
  2. Flow may make N number of commits to the table during the sync and will wait once complete until T2
  3. T2 at min5 and will start sync2 for all data that came into Kafka between min0-min5

Example: Sync with Insufficient Resources

Sync1 - delayed

  1. T1 at min0 starts sync1 ingestion for all data it found available in Kafka at min0.
  2. Sync1 did not finish before T2, so T2 will now fire as soon as Sync1 is done. In this example it was 1min late completing at min6.

Sync2 - on time

  1. T2 at min6 starts sync2 ingestion for all data found available in Kafka at min6.
  2. If sync2 finishes before T3, it waits until the sync frequency time has elapsed before beginning sync3. In this case, this means the Stream capture successfully caught up to stable state.

Sync2 - Alternate scenario - delay

  1. If sync2 did not finish before T3, then T3 would trigger sync3 immediately after sync2 completes. This cycle would continue until the stream capture catches up.

Duration Definitions

  • Sync Duration: Time it takes for the Flow sync measured from start of trigger to completion (including transformations, catalog syncs, etc.).
  • Commit Duration: When writing data, Onehouse breaks the data into commit batches. A single sync may write data in many commits.

How Syncs Optimize Resource Usage

  • Compute resources for a Flow are only allocated and used during the sync. During the period between syncs, the Flow does not use any resources. Note that each Cluster will keep one OCU active to check for new data.
  • If the source has no new data to process at the start of the sync, the sync is skipped to preserve resources.

Delays

A Flow is considered delayed when a sync takes longer than the delay threshold.

The delay threshold is disabled by default for new Flows and can be set in the Flow's advanced configurations. The delay theshold is calculated as flow.delayThreshold.numSyncIntervals * Min Sync Frequency.

If the active sync surpasses the delay threshold, the Flow will move to the Delayed status. The Delayed status is purely informational and will not change the functionality of the Flow. The Flow will move back to the Running status after another sync completes within the delay threshold.

When a Flow is delayed, you will receive a notification (capped at 3 notifications per Flow per day). You will also receive a notification when the Flow returns to the Running status.

Suggestions for handling delayed Flows:

  1. Review the minimum sync frequency. A lower minimum sync frequency (ie. more frequent syncs) will generally require more compute resources.
  2. Review the OCU Limit for the Cluster. If your Cluster is hitting the OCU limit, the Flow may not have enough resources to process incoming data within the expected interval.

Onehouse also provides charts and metrics to help you understand Flow delays. The Flow details page exposes charts for the Sync Duration and Data Pending to Write (ie. data in the source that has not been processed). For additional granularity and customizability, Advanced Monitoring provides lag monitoring charts and metrics.