Onehouse Kafka Development Recommendations

When designing messages for Apache Kafka, consider these best practices to ensure efficient, reliable, and scalable data pipelines:

Architectural Decisions

Data Structure within Kafka Messages: Single Object vs. List of Objects

TL-DR; For Onehouse users with low-latency requirements, we recommend a single object per message due to simple processing, better scalability and fault tolerance. Code examples can be found at the end of this document.

The optimal data structure for your Kafka messages largely depends on your specific use case and the nature of your data. However, here's a general guideline to help you make an informed decision:

Single Object per Message:

Use Case: When you want to treat each data point as an independent unit.
Benefits:
- Fastest processing times: downstream ETL jobs do not need to unroll arrays or nested objects to generate multiple rows.
- Simpler processing: Each message can be processed individually without considering the context of other messages.
- Better scalability: Consumers can process messages independently, allowing for horizontal scaling.
- Easier to implement fault tolerance: Failed messages can be retried or dead-lettered without affecting other messages.
Example: A stream of sensor readings, where each reading is a separate event.

List of Objects per Message:

Use Case: When you want to group related data points together and process them as a batch.
Benefits:
- Reduced network overhead: Fewer network round trips are required to send a batch of data in certain situations.
- Better data integrity: Related data points are guaranteed to be processed together.
Example: A stream of orders, where each order consists of multiple items. Model the input data as an Order with {items:[]} instead of an array [].

Key Considerations:

Message Size: Larger messages can impact performance, especially if you have many small messages. Consider batching smaller messages or compressing larger ones.
Processing Complexity: If your processing logic is complex and requires multiple data points, batching might be more efficient.
Fault Tolerance: If you need strong fault tolerance, consider sending each data point as a separate message to avoid losing entire batches.
Schema Evolution: If your data schema is evolving, ensure that your message format can accommodate changes without breaking existing consumers.

Should I use Kafka Schema Registry?

TL-DR; For Onehouse users, we recommend yes. If you are using AWS MSK, you would use AWS Glue Schema Registry. If you are using Confluent Kafka, you would use Confluent Schema Registry.

Kafka Schema Registry is an essential component of modern data pipelines, offering a robust solution for managing and validating schemas (supporting schema evolution), ensuring data quality (early error detection, ensures data consistency), and optimizing performance (helps with data serialization and deserialization).

What kafka message format should I use for the best performance?

TL-DR; For Onehouse users, we recommend a single object per message using AVRO message format due to performance. Code examples can be found at the end of this document.

When it comes to data serialization performance, Protobuf generally takes the lead, followed by Avro. JSON typically lags behind due to its textual nature and less efficient parsing. However Protobuf is typically more hard to “use”, as a result, Avro is the best balance of performance and features.

Key factors influencing performance:

Serialization/Deserialization Speed:
- Protobuf: Excels in speed due to its binary format and efficient parsing.
- Avro: Offers good performance, often faster than JSON, especially for large datasets.
- JSON: Slower than both Protobuf and Avro, particularly for large datasets.
Schema Evolution:
- Avro: Highly flexible, allowing for schema evolution without breaking backward compatibility.
- Protobuf: Supports schema evolution, but requires more careful planning.
- JSON: Less strict about schema evolution, but can lead to compatibility issues if not handled properly.
Data Size:
- Protobuf: Typically produces the smallest serialized data size.
- Avro: Offers a good balance between size and performance.
- JSON: Produces larger data sizes compared to binary formats.
Language Support:
- Protobuf: Strong support across various programming languages.
- Avro: Good support, but may be less mature in some languages.
- JSON: Universal support across all programming languages.

When to Choose Which:

Protobuf: Ideal for high-performance, low-latency applications where data size and speed are critical.
Avro [Recommended]: Suitable for large-scale data systems, data pipelines, and analytics use cases where schema evolution and flexibility are important.
JSON: Best for human-readable data formats and simple data structures, but may not be the most efficient choice for high-performance applications. You may need this option if the schema from the data producer is not well known or defined.

What if I want to send a list of objects in a Kafka message, what are my options?

TL-DR; Can be done. You can use the format provided by Onehouse’s explode array OR we can build a custom transformation.

We can use Onehouse’s explode array option. More details can be found at https://docs.onehouse.ai/docs/product/ingest-data/transformations/explode-array. This requires that the JSON data contains key and values and/or key and list as the value. It cannot be just a list.

The below is a list since it starts with [ and will not work without customization.

[
   {
      "points": [
           {
              "measurement": "device_data",
              "tags": {
                  "cnmsName": "LAX",
                  "modem_id": "430287",
                  "satnetId": "325331",
                  "Band": "KU",
                  "modem": "SC-BDHZ_TEST-KU2"
               },
              "fields": {
                  "modcod": "16APSK 2/3 normal",
                  "sent_symbols": 4496349218.0,
                  "signalled_cni": 11.46
               },
              "time": 1730136240
           }
       ]
   },
   {
      "points": [
           {
              "measurement": "device_data",
              "tags": {
                  "cnmsName": "LAX",
                  "modem_id": "430287",
                  "satnetId": "325331",
                  "Band": "KU",
                  "modem": "SC-BDHZ_TEST-KU2"
               },
              "fields": {
                  "modcod": "16APSK 23/36 normal",
                  "sent_symbols": 4496366614.0,
                  "signalled_cni": 11.08
               },
              "time": 1730136250
           }
       ]
   }
]

This is an example of a JSON payload with multiple entries. It starts with {

// Input
// Mode = "Recursive"
{
  "customerId": "123abc",
  "account": {
    "accountId": 999999,
    "preferences": {
      "emailNotification": true
    }
  },
  "deliveryAddress": null,
  "orders": [
    {
      "orderId": 77777,
      "orderDetails": {
        "isGift": true
      }
    },
    {
      "orderId": 444444,
      "orderDetails": {
        "isGift": false
      }
    }
  ]
}


// Ouput Row 1
{
  "customerId": "123abc",
  "account": {
    "accountId": 999999,
    "preferences": {
      "emailNotification": true
    }
  },
  "deliveryAddress": null,
  "orders": {
    "orderId": 77777,
    "orderDetails": {
      "isGift": true
    }
  }
}


// Ouput Row 2
{
  "customerId": "123abc",
  "account": {
    "accountId": 999999,
    "preferences": {
      "emailNotification": true
    }
  },
  "deliveryAddress": null,
  "orders": {
    "orderId": 444444,
    "orderDetails": {
      "isGift": false
    }
  }
}

If you’d like Onehouse to support another JSON data format, customization will be required.

What happens if you have a json with many nested levels?

TL-DR; You would use the Onehouse flatten-struct to persist it to the data lakehouse.

We can use Onehouse’s explode flatten-struct option. More details can be found at https://docs.onehouse.ai/docs/product/ingest-data/transformations/flatten-struct/.

Code Examples

Python Avro code examples

Python JSON code examples

https://github.com/confluentinc/confluent-kafka-python/blob/master/examples/json_producer.py

Java AVRO and JSON code examples

https://developer.confluent.io/get-started/java/#client-examples-java

Architectural Decisions

Data Structure within Kafka Messages: Single Object vs. List of Objects​

Should I use Kafka Schema Registry?​

What kafka message format should I use for the best performance?​

What if I want to send a list of objects in a Kafka message, what are my options?​

What happens if you have a json with many nested levels?​