Simplifying GenAI Data Pipelines with Qlik™ Talend Cloud

Generative Artificial Intelligence (GenAI) and related applications have exploded into the tech scene over the last couple of years.  While the technology shows great promise, building data pipelines that leverage customers structured and unstructured data is a challenging and high effort integration activity.

Qlik™ Talend Cloud (QTC) AI-ready data capabilities enable customers to simplify and accelerate the work needed to have their data flowing to Large Language Model (LLM) Retrieval Augmented Generation (RAG) based GenAI applications. 

In the article we will introduce you to this exciting new capability to simplify the use of your data with GenAI applications.

Background – GenAI, LLM, RAG, Vector stores
Before diving into how QTC AI-ready data capabilities assist, leveraging automation, in having enterprise data be made available seamlessly to RAG based GenAI applications, lets outline the technologies involved and the complexities found when building GenAI applications from scratch.

RAG is a method of implementing GenAI applications that ground the LLM with the data context that the LLM must use when answering a query. It is used in conjunction with LLMs to both avoid the need to train an LLM on customer specific data and limit the scope of the data the LLM will use to answer questions posted to it.  While LLM based chat interfaces, such as ChatGPT, are the most readily recognizable element of a GenAI application, there are several precursor technologies and processes that need to be selected and integrated, typically with complex code-based methods.

Anatomy of a RAG based solution
A typical RAG based GenAI solution contains the following components and process flow.

To service a query from the user the RAG Application or Chat bot on enterprise data, the enterprise data needs to be loaded into a vector store with appropriate LLM embeddings. An LLM embedding refers to a vector representation of text (such as a word, sentence, or document) generated by a LLM like GPT, BERT, or other advanced models. The purpose of embeddings is to capture the semantic meaning of the text in a way that allows the model to perform various tasks, such as similarity search, classification, or language generation, more efficiently. An embedding is a high-dimensional numerical vector that represents a piece of data (like words or sentences) in such a way that semantically similar pieces of data are closer together in the vector space. This allows models to process and compare pieces of text effectively.

This vector is then passed to the LLM along with the text of the user query for the LLM to then use as the context against which the embeddings generated from the user query text to generate the response back to the user.

RAG based solution technology components
For this process to work, several technology decisions and integrations need to be made in advance.

  1. The data source systems that currently house the enterprise data needed to answer questions. There would be typically multiple databases and applications whose data need to be integrated to achieve coherent answers. This includes unstructured text data in documents and knowledgebases.
  2. The platform on which all this data will be integrated. Very popular cloud-based platforms, E.g. Snowflake and Databricks.
  3. The Vector database on which to store the enterprise data embeddings. Cloud platforms (Snowflake Cortex, Databricks Mosaic) typically provide their own Vector DB and point solutions such as ElasticSearch, Pinecone, OpenSearch, etc. are also popular choices.
  4. The LLM to use to generate the enterprise data embeddings and for completions and chatting. There are ample choices for this as well both through hyperscaler AI platforms (Azure OpenAI, Amazon Bedrock), cloud data platforms (Snowflake Cortex, Databricks Mosaic) and independent providers (OpenAI, Anthropic).

All of this together paints the following picture of the required integration.

An implementation of this solution requires large amounts of effort scripting/ coding and specialized knowledge.  As we’ll see next, Qlik™ Talend Cloud automates most of the integration and only requires configuration and selections of the technology to be utilized.

Qlik™ Talend Cloud – AI-ready pipelines
Qlik™ Talend Cloud (QTC) is purpose built to simplify and accelerate the implementation of RAG based GenAI data integration pipelines by using a low/no code approach. Next we cover each of the features in detail and how they leverage automation to enable this capability.

Data source connectivity
QTC offers no-code connectivity to hundreds of data sources, including enterprise systems, mainframes, SAP, databases, and SaaS applications. It offers efficient, zero footprint, and minimal impact near real-time log based Changed Data Capture (CDC) or incremental API to only send data and changes once, without the need to reload the same data over and over, from source to target. The intuitive interface allows for an easy implementation of this connectivity and movement process, as shown below.

Data preparation/ transformation
Once the data is in the target cloud platform the next step is to prepare it for vectorization. This entails creating derived data sets with the appropriate field and record joining and filtering that feed the relevant bits of data for the LLM to use. QTC offers multi-modal transformation design experience ranging from no code Transformation Flows to pro-code GenAI assisted query crafting.

Data modeling
Once the necessary data sets have been generated, we then define relationship metadata between data sets. This allows for the subsequent AI-ready data step to recognize the potential building blocks for the document to prepare and store in the Vector DB.

AI-ready Data and Vector DB/LLM integration
NOTE: This feature is currently in private preview in preparation for General Availability (GA) on the first quarter of 2025.

The data to be vectorized needs to go through a process of parsing, chunking, embedding, and indexing. Structured data (from tables and columns) needs to be converted to document format prior to these steps. QTC shines in this area with an intuitive interface for determining the elements to include in the document.

1. From a transformation step, we select the option to create AI-Ready data

2. The specify where to store the vectors

We can store vectors in either:

     a. External vector database

     b. Data project platform. This is dependent on the platform for the project this task is a part of. Either Snowflake Cortex  or Databricks Mosaic.
     c. Qlik Answers™ knowledge base. For information on this option, check this page.

3. Specify the LLM connection. This connection and specified models will be used for both creating the embeddings for storing the document data in the Vector DB and also to power the completions of the chat interface available to the implementer to test the LLM. The options here depend on the prior choice of Vector DB.

     a. Using external LLM

     b. Using data project platform LLM. Refer to the following for more information on Databricks Mosaic or Snowflake Cortex
     c. In either case, a valid embeddings and completion model need to be specified

4. Create AI-ready documents. In this step we leverage the datasets and relationships defined in the transformation task to create the documents to be vectorized.  We start with a parent data set, on the far right of the model diagram, and select the child elements to become part of the document.

5. We’re done!  The next step is to prepare and run the task and test the data and LLM with the chat with your data function.

Note: This interface is intended for the AI-ready data implementer to test the integration of the data and processing components (LLM, Vector DB, etc.). It’s not intended to be and end user chat interface.

The completed pipeline would look like in the image below

Conclusion – Accelerating your GenAI journey
GenAI offers new and exciting capabilities to interact with data. Building the workflow that combines all the data sources, processing, and technologies typically entails a large effort. QTC accelerates enterprise GenAI implementations and allows for a faster time to value at a lower effort and cost than otherwise.

Whether using automatic ingestion of data from structured or unstructured sources, transformation into required data sets, the creation of a vector record with appropriate LLM embeddings, or the testing of chat answers, QTC lowers the barrier of entry and adoption to deliver RAG based GenAI solutions on your data. 

TAI-Ready tasks are currently in private-preview in Qlik™ Talend Cloud

For information about Qlik™, click here: qlik.com.
For specific and specialized solutions from QQinfo, click here: QQsolutions.
In order to be in touch with the latest news in the field, unique solutions explained, but also with our personal perspectives regarding the world of management, data and analytics, click here: QQblog !