Archives 2023

Vector stores – RAGs to Riches: Elevating AI with External Data

Vector stores

As generative AI applications continue to push the boundaries of what’s possible in tech, vector stores have emerged as a crucial component, streamlining and optimizing the search and retrieval of relevant data. In our previous discussions, we’ve delved into the advantages of vector DBs over traditional databases, unraveling the concepts of vectors, embeddings, vector search strategies, approximate nearest neighbors (ANNs), and similarity measures. In this section, we aim to provide an integrative understanding of these concepts within the realm of vector DBs and libraries.

The image illustrates a workflow for transforming different types of data—Audio, Text, and Videos— into vector embeddings.

  • Audio: An audio input is processed through an “Audio Embedding model,” resulting in “Audio vector embeddings.”
  • Text: Textual data undergoes processing in a “Text Embedding model,” leading to “Text vector embeddings.”
  • Videos: Video content is processed using a “Video Embedding model,” generating “Video vector embeddings.”

Once these embeddings are created, they are subsequently utilized (potentially in an enterprise vector database system) to perform “Similarity Search” operations. This implies that the vector embeddings can be compared to find similarities, making them valuable for tasks such as content recommendations, data retrieval, and more.

Figure 4.9 – Multimodal embeddings process in an AI application

What is a vector database?

A vector database (vector DB) is a specialized database designed to handle highly dimensional vectors primarily generated from embeddings of complex data types such as text, images, or audio. It provides capabilities to store and index unstructured data and enhance searches, as well as retrieval capabilities as a service.

Modern vector databases that are brimming with advancements empower you to architect resilient enterprise solutions. Here, we list 15 key features to consider when choosing a vector DB. Every feature may not be important for your use case, but it might be a good place to start. Keep in mind that this area is changing fast, so there might be more features emerging in the future:

  • Indexing: As mentioned earlier, indexing refers to the process of organizing highly dimensional vectors in a way that allows for efficient similarity searches and retrievals. A vector DB offers built-in indexing features designed to arrange highly dimensional vectors for swift and effective similarity-based searches and retrievals. Previously, we discussed indexing algorithms such as FAISS and HNSW. Many vector DBs incorporate such features natively. For instance, Azure AI Search integrates the HNSW indexing service directly.
  • Search and retrieval: Instead of relying on exact matches, as traditional databases do, vector DBs provide vector search capabilities as a service, such as approximate nearest neighbors (ANNs), to quickly find vectors that are roughly the closest to a given input. To quantify the closeness or similarity between vectors, they utilize similarity measures such as cosine similarity or Euclidean distance, enabling efficient and nuanced similarity-based searches in large datasets.
  • Create, read, update, and delete: A vector DB manages highly dimensional vectors and offers create, read, update, and delete (CRUD) operations tailored to vectorized data. When vectors are created, they’re indexed for efficient retrieval. Reading often means performing similarity searches to retrieve vectors closest to a given query vector, typically using methods such as ANNs. Vectors can be updated, necessitating potential re-indexing, and they can also be deleted, with the database adjusting its internal structures accordingly to maintain efficiency and consistency.
  • Security: This meets GDPR, SOC2 Type II, and HIPAA rules to easily manage access to the console and use SSO. Data is encrypted when stored and in transit, which also provides more granular identity and access management features.
  • Serverless: A high-quality vector database is designed to gracefully autoscale with low management overhead as data volumes soar into millions or billions of entries, distributing seamlessly across several nodes. Optimal vector databases grant users the flexibility to adjust the system in response to shifts in data insertion, query frequencies, and underlying hardware configurations.
  • Hybrid search: Hybrid search combines traditional keyword-based search methods with other search mechanisms, such as semantic or contextual search, to retrieve results from both the exact term matches and by understanding the underlying intent or context of the query, ensuring a more comprehensive and relevant set of results.
  • Semantic re-ranking: This is a secondary ranking step to improve the relevance of search results. It re-ranks the search results that were initially scored by state-of-the-art ranking algorithms such as BM25 and RRF based on language understanding. For instance, Azure AI search employs secondary ranking that uses multi-lingual, deep learning models derived from Microsoft Bing to elevate the results that are most relevant in terms of meaning.
  • Auto vectorization/embedding: Auto-embedding in a vector database refers to the automatic process of converting data items into vector representations for efficient similarity searches and retrieval, with access to multiple embedding models.
  • Data replication: This ensuresdata availability, redundancy, and recovery in case of failures, safeguarding business continuity and reducing data loss risks.
  • Concurrent user access and data isolation: Vector databases support a large number of users concurrently and ensure robust data isolation to ensure updates remain private unless deliberately shared.
  • Auto-chunking: Auto-chunking is the automated process of dividing a larger set of data or content into smaller, manageable pieces or chunks for easier processing or understanding. This process helps preserve the semantic relevance of texts and addresses the token limitations of embedding models. We will learn more about chunking strategies in the upcoming sections in this chapter.
  • Extensive interaction tools: Prominent vector databases, such as Pinecone, offer versatile APIs and SDKs across languages, ensuring adaptability in integration and management.
  • Easy integration: Vector DBs provide seamless integration with LLM orchestration frameworks and SDKs, such as Langchain and Semantic Kernel, and leading cloud providers, such as Azure, GCP, and AWS.
  • User-friendly interface: Thisensures an intuitive platform with simple navigation and direct feature access, streamlining the user experience.
  • Flexible pricing models: Provides flexible pricing models as per user needs to keep the costs low for the user.
  • Low downtime and high resiliency: Resiliency in a vector database (or any database) refers to its ability to recover quickly from failures, maintain data integrity, and ensure continuous availability even in the face of adverse conditions, such as hardware malfunctions, software bugs, or other unexpected disruptions.

As of early 2024, a few prominent open source vector databases include Chroma, Milvus, Quadrant, and Weaviate, while Pinecone and Azure AI search are among the leading proprietary solutions.

Vector DB limitations – RAGs to Riches: Elevating AI with External Data

Vector DB limitations

  • Accuracy vs. speed trade-off: When dealing with highly dimensional data, vector DBs often face a trade-off between speed and accuracy for similarity searches. The core challenge stems from the computational expense of searching for the exact nearest neighbors in large datasets. To enhance search speed, techniques such as ANNs are employed, which quickly identify “close enough” vectors rather than the exact matches. While ANN methods can dramatically boost query speeds, they may sometimes sacrifice pinpoint accuracy, potentially missing the true nearest vectors. Certain vector index methods, such as product quantization, enhance storage efficiency and accelerate queries by condensing and consolidating data at the expense of accuracy.
  • Quality of embedding: The effectiveness of a vector database is dependent on the quality of the vector embedding used. Poorly designed embeddings can lead to inaccurate search results or missed connections.
  • Complexity: Implementing and managing vector databases can be complex, requiring specialized knowledge about vector search strategy indexing and chunking strategies to optimize for specific use cases.

Vector libraries

Vector databases may not always be necessary. Small-scale applications may not require all the advanced features that vector DBs provide. In those instances, vector libraries become very valuable. Vector libraries are usually sufficient for small, static data and provide the ability to store in memory, index, and use similarity search strategies. However, they may not provide features such as CRUD support, data replication, and being able to store data on disk, and hence, the user will have to wait for a full import before they can query. Facebook’s FAISS is a popular example of a vector library.

As a rule of thumb, if you are dealing with millions/billions of records and storing data that are changing frequently, require millisecond response times, and more long-term storage capabilities on disk, it is recommended to use vector DBs over vector libraries.

Vector DBs vs. traditional databases – Understanding the key differences

As stated earlier, vector databases have become pivotal, especially in the era of generative AI, because they facilitate efficient storage, querying, and retrieval of highly dimensional vectors that are nothing but numerical representations of words or sentences often produced by deep learning models. Traditional scalar databases are designed to handle discrete and simple data types, making them ill-suited for the complexities of large-scale vector data. In contrast, vector databases are optimized for similar searches in the vector space, enabling the rapid identification of vectors that are “close” or “similar” in highly dimensional spaces. Unlike conventional data models such as relational databases, where queries commonly resemble “retrieve the books borrowed by a particular member” or “identify the items currently discounted,” vector queries primarily seek similarities among vectors based on one or more reference vectors. In other words, queries might look like “identify the top 10 images of dogs similar to the dog in this photo” or “locate the best cafes near my current location.” At retrieval time, vector databases are crucial, as they facilitate the swift and precise retrieval of relevant document embeddings to augment the generation process. This technique is also called RAG, and we will learn more about it in the later sections.

Imagine you have a database of fruit images, and each image is represented by a vector (a list of numbers) that describes its features. Now, let’s say you have a photo of an apple, and you want to find similar fruits in your database. Instead of going through each image individually, you convert your apple photo into a vector using the same method you used for the other fruits. With this apple vector in hand, you search the database to find vectors (and therefore images) that are most similar or closest to your apple vector. The result would likely be other apple images or fruits that look like apples based on the vector representation.

Figure 4.10 – Vector represenation

Vector DB sample scenario – Music recommendation system using a vector database – RAGs to Riches: Elevating AI with External Data

Vector DB sample scenario – Music recommendation system using a vector database

Let’s consider a music streaming platform aiming to provide song recommendations based on a user’s current listening. Imagine a user who is listening to “Song X” on the platform.

Behind the scenes, every song in the platform’s library is represented as a highly dimensional vector based on its musical features and content, using embeddings. “Song X” also has its vector representation. When the system aims to recommend songs similar to “Song X,” it doesn’t look for exact matches (as traditional databases might). Instead, it leverages a vector DB to search for songs with vectors closely resembling that of “Song X.” Using an ANN search strategy, the system quickly sifts through millions of song vectors to find those that are approximately nearest to the vector of “Song X.” Once potential song vectors are identified, the system employs similarity measures, such as cosine similarity, to rank these songs based on how close their vectors are to “Song X’s” vector. The top-ranked songs are then recommended to the user.

Within milliseconds, the user gets a list of songs that musically resemble “Song X,” providing a seamless and personalized listening experience. All this rapid, similarity-based recommendation magic is powered by the vector database’s specialized capabilities.

Common vector DB applications

  • Image and video similarity search: In the context of image and video similarity search, a vector DB specializes in efficiently storing and querying highly dimensional embeddings derived from multimedia content. By processing images through deep learning models, they are converted into feature vectors, a.k.a embeddings, that capture their essential characteristics. When it comes to videos, an additional step may need to be carried out to extract frames and then convert them into vector embeddings. Contrastive language-image pre-training (CLIP) from OpenAI is a very popular choice for embedding videos and images. These vector embeddings are indexed in the vector DB, allowing for rapid and precise retrieval when a user submits a query. This mechanism powers applications such as reverse image and video search, content recommendations, and duplicate detection by comparing and ranking content based on the proximity of their embeddings.
  • Voice recognition: Voice recognition with vectors is akin to video vectorization. Analog audio is digitized into short frames, each representing an audio segment. These frames are processed and stored as feature vectors, with the entire audio sequence representing things such as spoken sentences or songs. For user authentication, a vectorized spoken key phrase might be compared to stored recordings. In conversational agents, these vector sequences can be inputted into neural networks to recognize and classify spoken words in speech and generate responses, similar to ChatGPT.
  • Long-term memory for chatbots: Virtual database management systems (VDBMs) can be employed to enhance the long-term memory capabilities of chatbots or generative models. Many

generative models can only process a limited amount of preceding text in prompt responses, which results in their inability to recall details from prolonged conversations. As these models don’t have inherent memory of past interactions and can’t differentiate between factual data and user-specific details, using VDBMs can provide a solution for storing, indexing, and referencing previous interactions to improve consistency and context-awareness in responses.

This is a very important use case and plays a key role in implementing RAG, which we will discuss in the next section.

The role of vector DBs in retrieval-augmented generation (RAG) – RAGs to Riches: Elevating AI with External Data

The role of vector DBs in retrieval-augmented generation (RAG)

To fully understand RAG and the pivotal role of vector DBs within it, we must first acknowledge the inherent constraints of LLMs, which paved the way for the advent of RAG techniques powered by vector DBs. This section sheds light on the specific LLM challenges that RAG aims to overcome and the importance of vector DBs.

First, the big question – Why?

In Chapter 1, we delved into the limitations of LLMs, which include the following:

  • LLMs possess a fixed knowledge base determined by their training data; as of February 2024, ChatGPT’s knowledge is limited to information up until April 2023.
  • LLMs can occasionally produce false narratives, spinning tales or facts that aren’t real.
  • They lack personal memory, relying solely on the input context length. For example, take GPT4-32K; it can only process up to 32K tokens between prompts and completions (we’ll dive deeper into prompts, completions, and tokens in Chapter 5).

To counter these challenges, a promising avenue is enhancing LLM generation with retrieval components. These components can extract pertinent data from external knowledge bases—a process termed RAG, which we’ll explore further in this section.

So, what is RAG, and how does it help LLMs?

Retrievalaugmented generation (RAG) was first introduced in a paper titled Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (https://arxiv.org/pdf/2005.11401. pdf) in November 2020 by Facebook AI Research (now Meta). RAG is an approach that combines the generative capabilities of LLMs with retrieval mechanisms to extract relevant information from vast datasets. LLMs, such as the GPT variants, have the ability to generate human-like text based on patterns in their training data but lack the means to perform real-time external lookups or reference specific external knowledge bases post-training. RAG addresses this limitation by using a retrieval model to query a dataset and fetch relevant information, which then serves as the context for the generative model to produce a detailed and informed response. This also helps in grounding the LLM queries with relevant information that reduces the chances of hallucinations.

The critical role of vector DBs

A vector DB plays a crucial role in facilitating the efficient retrieval aspect of RAG. In this setup, each piece of information, such as text, video, or audio, in the dataset is represented as a highly dimensional vector and indexed in a vector DB. When a query from a user comes in, it’s also converted into a similar vector representation. The vector DB then rapidly searches for vectors (documents) in the dataset that are closest to the query vector, leveraging techniques such as ANN search. Then, it attaches the query with relevant content and sends it to the LLMs to generate a response. This ensures that the most relevant information is retrieved quickly and efficiently, providing a foundation for the generative model to build upon.

Example of an RAG workflow – RAGs to Riches: Elevating AI with External Data

Example of an RAG workflow

Let’s walk through as an example step by step, as shown in the image. Imagine a platform where users can ask about ongoing cricket matches, including recent performances, statistics, and trivia:

  1. Suppose the user asks, “How did Virat Kohli perform in the last match, and what’s an interesting fact from that game?” Since the LLM was trained until April 2023, the LLM may not have this answer.
  2. The retrieval model will embed the query and send it to a vector DB.
  3. All the latest cricket news is stored in a vector DB in a properly indexed format using ANN strategies such as HNSW. The vector DB performs a cosine similarity with the indexed information and provides a few relevant results or contexts.
  4. The retrieved context is then sent to the LLM along with the query to synthesize the information and provide a relevant answer.
  5. The LLM provides the relevant answer: “Virat Kohli scored 85 runs off 70 balls in the last match. An intriguing detail from that game is that it was the first time in three years that he hit more than seven boundaries in an ODI inning.”

The following image illustrates the preceding points:

Figure 4.11 – Representation of RAG workflow with vector database

Business applications of RAG

In the following list, we have mentioned a few popular business applications of RAG based on what we’ve seen in the industry:

  • Enterprise search engines: One of the most prominent applications of RAG is in the realm of enterprise learning and development, serving as a search engine for employee upskilling. Employees can pose questions about the company, its culture, or specific tools, and RAG swiftly delivers accurate and relevant answers.
  • Legal and compliance: RAG fetches relevant case laws or checks business practices against regulations.
  • Ecommerce: RAG suggests products or summarizes reviews based on user behavior and queries.
  • Customer support: RAG provides precise answers to customer queries by pulling information from the company’s knowledge base and providing solutions in real time.
  • Medical and healthcare: RAG retrieves pertinent medical research or provides preliminary symptom-based suggestions.

Chunking strategies – RAGs to Riches: Elevating AI with External Data

Chunking strategies

In our last discussion, we delved into vector DBs and RAG. Before diving into RAG, we need to efficiently house our embedded data. While we touched upon indexing methods to speed up data fetching, there’s another crucial step to take even before that: chunking.

What is chunking?

In the context of building LLM applications with embedding models, chunking involves dividing a long piece of text into smaller, manageable pieces or “chunks” that fit within the model’s token limit. The process involves breaking text into smaller segments before sending these to the embedding models. As shown in the following image, chunking happens before the embedding process. Different documents have different structures, such as free-flowing text, code, or HTML. So, different chunking strategies can be applied to attain optimal results. Tools such as Langchain provide you with functionalities to chunk your data efficiently based on the nature of the text.

The diagram below depicts a data processing workflow, highlighting the chunking step, starting with raw “Data sources” that are converted into “Documents.” Central to this workflow is the “Chunk” stage, where a “TextSplitter” breaks the data into smaller segments. These chunks are then transformed into numerical representations using an “Embedding model” and are subsequently indexed into a “Vector DB” for efficient search and retrieval. The text associated with the retrieved chunks is then sent as context to the LLMs, which then generate a final response:

Fig 4.12 – Chunking Process

But why is it needed?

Chunking is vital for two main reasons:

  • Chunking strategically divides document text to enhance its comprehension by embedding models, and it boosts the relevance of the content retrieved from a vector DB. Essentially, it refines the accuracy and context of the results sourced from the database.
  • It tackles the token constraints of embedding models. For instance, Azure’s OpenAI embedding models like text-embedding-ada-002 can handle up to 8,191 tokens, which is about 6,000 words, given each token averages four characters. So, for optimal embeddings, it’s crucial our text stays within this limit.

Popular chunking strategies

  • Fixed-size chunking: This is avery common approach that defines a fixed size (200 words), which is enough to capture the semantic meaning of a paragraph, and it incorporates an overlap of about 10–15% as an input to the vector embedding generation model. Chunking data with a slight overlap between text ensures context preservation. It’s advisable to begin with a roughly 10% overlap. Below is a snippet of code that demonstrates the use of fixed-size chunking with LangChain:

text = “Ladies and Gentlemen, esteemed colleagues, and honored \ guests. Esteemed leaders and distinguished members of the \ community. Esteemed judges and advisors. My fellow citizens. Last \ year, unprecedented challenges divided us. This year, we stand \ united, ready to move forward together”

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)

texts = text_splitter.split_text(text)

print(texts)

The output is the following:

[‘Ladies and Gentlemen, esteemed colleagues, and honored guests. Esteemed leaders and distinguished members’, ’emed leaders and distinguished members of the community. Esteemed judges and advisors. My fellow citizens.’, ‘. My fellow citizens. Last year, unprecedented challenges divided us. This year, we stand united,’, ‘, we stand united, ready to move forward together’]

  • Variable-size chunking: Variable-size chunking refers to the dynamic segmentation of data or text into varying-sized components, as opposed to fixed-size divisions. This approach accommodates the diverse structures and characteristics present in different types of data.
  • Sentence splitting: Sentence transformer models are neural architectures optimized for embedding at the sentence level. For example, BERT works best when chunked at the sentence level. Tools such as NLTK and SpaCy provide functions to split the sentences within a text.
  • Specialized chunking: Documents, such as research papers, possess a structured organization of sections, and the Markdown language, with its unique syntax, necessitates specialized chunking, resulting in the proper separation between sections/pages to yield contextually relevant chunks.
  • Code Chunking: When embedding code into your vector DB, this technique can be invaluable. Langchain supports code chunking for numerous languages. Below is a snippet code to chunk your Python code:

from langchain.text_splitter import (

RecursiveCharacterTextSplitter,

Language,

)

PYTHON_CODE = “””

class SimpleCalculator:

def add(self, a, b):

return a + b

def subtract(self, a, b):

return a – b

  • Using the SimpleCalculator calculator = SimpleCalculator() sum_result = calculator.add(5, 3) diff_result = calculator.subtract(5, 3)

“””

python_splitter = RecursiveCharacterTextSplitter.from_language(

language=Language.PYTHON, chunk_size=50, chunk_overlap=0

)

python_docs = python_splitter.create_documents([PYTHON_CODE]) python_docs

The output is the following:

[Document(page_content=’class SimpleCalculator:\n def add(self, a, b):’),

Document(page_content=’return a + b’),

Document(page_content=’def subtract(self, a, b):’),

Document(page_content=’return a – b’),

Document(page_content=’# Using the SimpleCalculator’),

Document(page_content=’calculator = SimpleCalculator()’),

Document(page_content=’sum_result = calculator.add(5, 3)’),

Document(page_content=’diff_result = calculator.subtract(5, 3)’)]

Chunking considerations

Chunking strategies vary based on data type and format and the chosen embedding model. For instance, code requires a distinct chunking approach compared to unstructured text. While models such as text-embedding-ada-002 excel with 256- and 512-token-sized chunks, our understanding of chunking is ever-evolving. Moreover, preprocessing plays a crucial role before chunking, where you can optimize your content by removing unnecessary text content, such as stop words, special symbols, etc., that add noise. For the latest techniques, we suggest regularly checking the text splitters section in the LangChain documentation, ensuring you employ the best strategy for your needs

(Split by tokens from Langchain: https://python.langchain.com/docs/modules/ data_connection/document_transformers/split_by_token).

Evaluation of RAG using Azure Prompt Flow – RAGs to Riches: Elevating AI with External Data

Evaluation of RAG using Azure Prompt Flow

Up to this point, we have discussed the development of resilient RAG applications. However, the question arises: How can we determine whether these applications are functioning as anticipated and if the context they retrieve is pertinent? While manual validation—comparing the responses generated by LLMs against ground truth—is possible, this method proves to be labor-intensive, costly, and challenging to execute on a large scale. Consequently, it’s essential to explore methodologies that facilitate automated evaluation on a vast scale. Recent research has delved into the concept of utilizing “LLM as a judge” to assess output, a strategy that Azure Prompt Flow incorporates within its offerings.

Azure Prompt Flow has built-in and structured metaprompt templates with comprehensive guardrails to evaluate your output against ground truth. The following mentions four metrics that can help you evaluate your RAG solution in Prompt Flow:

  • Groundedness: Measures the alignment of the model’s answers with the input source, making sure the model’s generated response is not fabricated. The model must always extract information from the provided “context” while responding to user’s query.
  • Relevance: Measures the degree to which the model’s generated response is closely connected to the context and user query.
  • Retrieval score: Measures the extent to which the model’s retrieved documents are pertinent and directly related to the given questions.
  • Custom metrics: While the above three are the most important for evaluating RAG applications, Prompt Flow allows you to use custom metrics, too. Bring your own LLM as a judge and define your own metrics by modifying the existing metaprompts. This also allows you to use open source models such as Llama and to build your own metrics from code with Python functions. The above evaluations are more no-code or low-code friendly; however, for a more pro-code friendly approach, azureml-metrics SDK, such as ROUGE, BLEU, F1-Score, Precision, and Accuracy, can be utilized as well.

The field is advancing quickly, so we recommend regularly checking Azure ML Prompt Flow’s latest updates on evaluation metrics. Start with the “Manual Evaluation” feature in Prompt Flow to gain a basic understanding of LLM performance. It’s important to use a mix of metrics for a thorough evaluation that captures both semantic and syntactic essence rather than relying on just one metric to compare the responses with the actual ground truth.