Generate Vector Embeddings
Description
This transformation generates vector embeddings from data that is incoming to the Onehouse Stream Capture and stores them as Array objects in your Onehouse tables.
Parameters
- Embedding Model: Onehouse currently supports 4 embedding models within the transformation.
- API Key: This is the API key you have registered with the model provider. Onehouse will use this key to generate embeddings.
- Column to Encode: The source column for the embeddings.
- Embedding Column Name: A new column of array type that the transformation creates to store the vector embeddings.
Input Requirements
- You must create an API Key with the appropriate model vendor in their console. Onehouse will use this API key to access the embedding model and generate the vector embeddings.
- Column to Encode must be of type "String" and contain valid text for the embedding model.
Expected Output
Will create a new column with the name defined in the "Embedding Column Name" input of type Array. This column will store the vector embeddings that are generated by the embedding model.
Examples
Input Data
input_text |
---|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi sit amet ipsum in dolor pretium finibus. In tincidunt velit purus, nec accumsan massa ornare quis." |
"Vestibulum felis ligula, faucibus rutrum velit non, luctus efficitur nisl. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos." |
"Maecenas elit nibh, cursus ut lectus at, condimentum tempus metus." |
...... |
"Cras et leo neque. Phasellus quis ligula et mauris posuere imperdiet." |
Output Table
input_text | text_embedding_small (embedding column) |
---|---|
"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi sit amet ipsum in dolor pretium finibus. In tincidunt velit purus, nec accumsan massa ornare quis." | [0.022581004, 0.025847694, -0.012096967, 0.019150978, ........., 0.008345376, -0.020192236, 0.02421435] |
"Vestibulum felis ligula, faucibus rutrum velit non, luctus efficitur nisl. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos." | [-0.006998475, -0.02374707, -0.0076769628, 0.039553322, ..............., 0.037542988, 0.039930258, -0.007764915, -0.039930258 ] |
"Maecenas elit nibh, cursus ut lectus at, condimentum tempus metus." | [0.0038274517, -0.006818836, 0.032096304, -0.022562964, .............. , -0.007866635, -0.00023768883, -0.026732443] |
...... | ...... |
"Cras et leo neque. Phasellus quis ligula et mauris posuere imperdiet." | [ -0.018214809, -0.005399093, -0.029936183, 0.00022593468, ............, 0.039935656, 0.021451153, -0.02116071, 0.020216778 ] |
Error Handling
The transformation will consider the following common errors with the embedding model and behave accordingly.
Error | Transformation Behavior |
---|---|
Number of Tokens too high for model. | Onehouse will automatically truncate the input to fit the number of tokens acceptable for the model. If you would like to see the proper model for the input data that your will provide, you can see the token limit docs for OpenAI here and Voyage here. |
API Rate Limits Exceeded. | If we encounter a rate limit error from the model, Onehouse will back off and retry the requests 3 times before sending you a notification. After sending you a notification, the transformation will fail the stream and you will have to unpause the Stream Capture after the API timeout expires in order to resume processing. |
Invalid input or failed embeddings. | When the input data is not valid or the embedding model cannot create embeddings, Onehouse will fail the Stream Capture and send you a notification. We recommend having a medallion architecture where upstream tables, with schema validation enabled, can be used to guarantee data quality. |