Real-life examples of fine-tuning success – Fine-Tuning – Building Domain-Specific LLM Applications

Real-life examples of fine-tuning success

In this section, we’ll explore a real-life example of a fine-tuning approach that OpenAI implemented, which yielded remarkable outcomes.

InstructGPT

OpenAI’s InstructGPT is one of the most successful stories of fine-tuned models that laid the foundation of ChatGPT. ChatGPT is said to be a sibling model to InstructGPT. The methods that are used to fine-tune ChatGPT are similar to InstructGPT. InstructGPT was created by fine-tuning pre-trained GPT-3 models with RHLF. Supervised fine-tuning is the first step in RLHF for generating responses aligned to human preferences.

In the beginning, GPT-3 models weren’t originally designed to adhere to user instructions. Their training focused on predicting the next word based on vast amounts of internet text data. Therefore, these models underwent fine-tuning using instructional datasets along with RLHF to enhance their ability to generate more useful and relevant responses aligned with human values when prompted with user instructions:

Figure 3.20 – The fine-tuning process with RLHF

This figure depicts a schematic representation showcasing the InstructGPT fine- tuning process: (1) initial supervised fine-tuning, (2) training the reward model, and (3) executing RL through PPO using this established reward model. The utilization of this data to train respective models is indicated by the presence of blue arrows. In step 2, boxes A-D are samples from models that get ranked by labelers.

The following figure provides a comparison of the response quality of fine-tuned models with RLHF, supervised fine-tuned models, and general GPT models. The Y-axis consists of a Likert scale and shows quality ratings of model outputs on a 1–7 scale (Y-axis), for various model sizes (X-axis), on prompts submitted to InstructGPT models via the OpenAI API. The results reveal that InstructGPT outputs receive significantly higher scores by labelers compared to outputs from GPT-3 models with both few-shot prompts and those without, as well as models that underwent supervised learning fine-tuning. The labelers that were hired for this work were independent and were sourced from Scale AI and Upwork:

Figure 3.21 – Evaluation of InstructGPT (image credits: Open AI)

InstructGPT can be assessed across dimensions of toxicity, truthfulness, and appropriateness. Higher scores are desirable for TruthfulQA and appropriateness, whereas lower scores are preferred for toxicity and hallucinations. Measurement of hallucinations and appropriateness is conducted based on the distribution of prompts within our API. The outcomes are aggregated across various model sizes:

Figure 3.22 – Evaluation of InstructGPT

In this section, we introduced the concept of fine-tuning and discussed a success stories of fine-tuning with RLHF that led to the development of InstructGPT.

Summary

Fine-tuning is a powerful technique for customizing models, but it may not always be necessary. As observed, it can be time-consuming and may have initial upfront costs. It’s advisable to start with easier and faster strategies, such as prompt engineering with few- shot examples, followed by data grounding using RAG. Only if the responses from the LLM remain suboptimal should you consider fine-tuning. We will discuss RAG and prompt engineering in the following chapters.

In this chapter, we delved into critical fine-tuning strategies tailored for specific tasks. Then, we explored an array of evaluation methods and benchmarks to assess your refined model. The RLHF process ensures your models align with human values, making them helpful, honest, and safe. In the upcoming chapter, we’ll tackle RAG methods paired with vector databases – an essential technique to ground your enterprise data and minimize hallucinations in LLM-driven applications.

A deep dive into vector DB essentials 2 – RAGs to Riches: Elevating AI with External Data

The following image visually represents the clustering of mammals and birds in a two-dimensional vector embedding space, differentiating between their realistic and cartoonish portrayals. This image depicts a spectrum between “REALISTIC” and “CARTOON” representations, further categorized into “MAMMAL” and “BIRD.” On the realistic side, there’s a depiction of a mammal (elk) and three birds (an owl, an eagle, and a small bird). On the cartoon side, there are stylized and whimsical cartoon versions of mammals and birds, including a comically depicted deer, an owl, and an exaggerated bird character. LLMs use such vector embedding spaces, which are numerical representations of objects in highly dimensional spaces, to understand, process, and generate information. For example, imagine an educational application designed to teach children about wildlife. If a student prompts the chatbot to provide images of birds in a cartoon representation, the LLM will search and generate information from the bottom right quadrant:

Figure 4.3 – Location of animals with similar characteristics in a highly

dimensional space, demonstrating “relatedness”

Now, let’s delve into the evolution of embedding models that produce embeddings, a.k.a numerical representations of objects, within highly dimensional spaces. Embedding models have experienced significant evolution, transitioning from the initial methods that mapped discrete words to dense vectors, such as word-to-vector (Word2Vec), global vectors for word representation (GloVe), and FastText to more sophisticated contextual embeddings using deep learning architectures. These newer models, such as embeddings from language models (ELMos), utilize long short-term memory (LSTM)-based structures to offer context-specific representations. The newer transformer architecture-based embedding models, which underpin models such as bidirectional encoder representations from transformers (BERT), generative pre-trained transformer (GPT), and their subsequent iterations, marked a revolutionary leap over predecessor models.

These models capture contextual information in unparalleled depth, enabling embeddings to represent nuances in word meanings based on the surrounding context, thereby setting new standards in various natural language processing tasks.

Important note:

In Jan 2024, OpenAI announced two third-generation embedding models, text-embedding-3-small and text-embedding-3-large, which are the newest models that have better performance,lower costs, and better multi -lingual retrieval and parameters to reduce the overall size of dimensions when compared to predecessor second-generation model, text-embedding-ada-002. Another key difference is the number of dimensions between the two generations. The third-generation models come in different dimensions, and the highest they can go up to is 3,072. As of Jan 2024, we have seen more production workloads using text-embedding-ada-002 in production, which has 1,536 dimensions. OpenAI recommends using the third-generation models going forward for improved performance and reduced costs.

We also wanted you to know that while OpenAI’s embedding model is one of the most popular choices when it comes to text embeddings, you can find the list of leading embedding models on Hugging Face (https://huggingface.co/spaces/mteb/leaderboard).

The following snippet of code gives an example of generating Azure OpenAI endpoints:

import openai

openai.api_type = “azure”

openai.api_key = YOUR_API_KEY

openai.api_base = “https://YOUR_RESOURCE_NAME.openai.azure.com” openai.api_version = “YYYY-MM-DD” ##Replace with latest version

response = openai.Embedding.create (

input=”Your text string goes here”,

engine=”YOUR_DEPLOYMENT_NAME”

)

embeddings = response[‘data’][0][’embedding’] print(embeddings)

In this section, we highlighted the significance of vector embeddings. However, their true value emerges when used effectively. Hence, we’ll now dive deep into indexing and vector search strategies, which are crucial for optimal data retrieval in the RAG workflow.

A deep dive into vector DB essentials – RAGs to Riches: Elevating AI with External Data

A deep dive into vector DB essentials

To fully comprehend RAG, it’s imperative to understand vector DBs because RAG relies heavily on its efficient data retrieval for query resolution. A vector DB is a database designed to store and efficiently query highly dimensional vectors and is often used in similarity searches and machine learning tasks. The design and mechanics of vector DBs directly influence the effectiveness and accuracy of RAG answers.

In this section, we will cover the fundamental components of vector DBs (vectors and vector embeddings), and in the next section, we will dive deeper into the important characteristics of vector DBs that enable a RAG-based generative AI solution. We will also explain how it differs from regular databases and then tie it all back to explain RAG.

Vectors and vector embeddings

A vector is a mathematical object that has both magnitude and direction and can be represented by an ordered list of numbers. In a more general sense, especially in computer science and machine learning, a vector can be thought of as an array or list of numbers that represents a point in a certain

dimensional space. For instances depicted in the following image, in 2D space (on the left), a vector might be represented as [x, y], whereas in 3D space (on the right), it might be [x, y, z]:

Figure 4.2 – Representation of vectors in 2D and 3D space

Vector embedding refers to the representation of objects, such as words, sentences, or even entire documents, as vectors in a highly dimensional space. A highly dimensional space denotes a mathematical space with more than three dimensions, frequently used in data analysis and machine learning to represent intricate data structures. Think of it as a room where you can move in more than three directions, facilitating the description and analysis of complex data. The embedding process converts words, sentences, or documents into vector representations, capturing the intricate semantic relationships between them. Hence, words with similar meanings tend to be close to each other in the highly dimensional space. Now, you must be wondering how this plays a role in designing generative AI solutions consisting of LLMs. Vector embeddings provide the foundational representation of data. They are a standardized numerical representation for diverse types of data, which LLMs use to process and generate information. Such an embedding process to convert words and sentences to a numerical representation is initiated by embedding models such as OpenAI’s text-embedding-ada-002. Let’s explain this with an example.

Semantic Kernel – Developing and Operationalizing LLM-based Apps: Exploring Dev Frameworks and LLMOps

Semantic Kernel

Semantic kernel, or SK, is a lightweight, open-source software development kit (SDK); it is a modern AI application development framework that enables software developers to build an AI orchestration to build agents, write code that can interact with agents, and also support generative AI tooling and concepts, such as natural language processing (NLP), which we covered in Chapter 2.

“Kernel” is at the core of everything!

Semantic Kernel revolves around the concept of a “kernel,” which is pivotal and is equipped with the necessary services and plugins to execute both native code and AI services, making it a central element for nearly all SDK components.

Every prompt or code executed within the semantic kernel passes through this kernel, granting developers a unified platform for configuring and monitoring their AI applications.

For instance, when a prompt is invoked through the kernel, it undertakes the process of selecting the optimal AI service, constructing the prompt based on a prompt template, dispatching the prompt to the service, and processing the response before delivering it back to the application. Additionally, the kernel allows for the integration of events and middleware at various stages, facilitating tasks such as logging, user updates, and the implementation of responsible AI practices, all from a single, centralized location called “kernel.”

Moreover, SK allows developers to define the syntax and semantics of natural language expressions and use them as variables, functions, or data structures in their code. SK also provides tools for parsing, analyzing, and generating natural language from code and, vice-versa, generating code from NLP.

You can build sophisticated and complex agents without having to be an AI expert by using semantic kernel SDK! The fundamental building blocks in semantic kernels for building agents are plugins, planners, and personas.

Fundamental components

Let’s dive into each one of them and understand what each one means.

  • Plugins enhance your agent’s functionality by allowing you to incorporate additional code. This enables the integration of new functions into plugins, utilizing native programming languages such as C# or Python. Additionally, plugins can facilitate interaction with LLMs through prompts or connect to external services via REST API calls. As an example, consider a plugin for a virtual assistant for a calendar application that allows it to schedule appointments, remind you of upcoming events, or cancel meetings. If you have used ChatGPT, you may be familiar with the concept of plugins, as they are integrated into it (namely, “Code Interpreter” or “Bing Search Plugin”).
  • Planners: In order to effectively utilize the plugin and integrate it with subsequent actions, the system must initially design a plan, a process that is facilitated by planners. This is where the planners help. Planners are sophisticated instructions that enable an agent to formulate a strategy for accomplishing a given task, often encapsulated in a simple prompt that guides the agent through function calling to achieve the objective.
  • As an example, take the development of a MeetingEventPlanner. This planner would guide the agent through the detailed process of organizing a meeting. It includes steps such as reviewing the availability of attendees’ calendars, sending out confirmation emails, drafting an agenda, and, finally, scheduling the meeting. Each step is carefully outlined to ensure the agent comprehensively addresses all the necessary actions for successful meeting preparation.
  • Personas: Personas are sets of instructions that shape the behavior of agents by imbuing them with distinct personalities. Often referred to as “meta prompts,” these guidelines endow agents with characters that can range from friendly and professional to humorous, and so forth. Additionally, they direct agents on the type of response to generate, which can vary from verbose to concise. We have explored meta prompts in great detail in Chapter 5; this concept is closely related.

Vector search strategies – RAGs to Riches: Elevating AI with External Data

Vector search strategies

Vector search strategies are crucial because they determine how efficiently and accurately highly dimensional data (such as embeddings) can be queried and retrieved. Optimal strategies ensure that the most relevant and contextually appropriate results are returned. In vector-based searching, there are primarily two main strategies: exact search and approximate search.

Exact search

The exact search method, as the term suggests, directly matches a query vector with vectors in the database. It uses an exhaustive approach to identify the closest neighbors, allowing minimal to no errors.

This is typically what the traditional KNN method employs. Traditional KNNs utilize brute force methods to find the K-nearest neighbors, which demands a thorough comparison of the input vector with every other vector in the dataset. Although computing the similarity for each vector is typically quick, the process becomes time-consuming and resource-intensive over extensive datasets because of the vast number of required comparisons. For instance, if you had a dataset of one million vectors and wanted to find the nearest neighbors for a single input vector, the traditional KNN would require one million distance computations. This can be thought of as looking up a friend’s phone number in a phone book by checking each entry one by one rather than using a more efficient search strategy that speeds up the process, which we will discuss in the next section.

Approximate nearest neighbors (ANNs)

In modern vector DBs, the search strategy known as ANN stands out as a powerful technique that quickly finds the near-closest data points in highly dimensional spaces, potentially trading off a bit of accuracy for speed. Unlike KNN, ANN prioritizes search speed at the expense of slight accuracy. Additionally, for it to function effectively, a vector index must be built beforehand.

The process of vector indexing

The process of vector indexing involves the organization of embeddings in a data structure called an index, which can be traversed quickly for retrieval purposes. Many ANN algorithms aid in forming a vector index, all aiming for rapid querying by creating an efficiently traversable data structure. Typically, they compress the original vector representation to enhance the search process.

There are numerous indexing algorithms, and this is an active research area. ANNs can be broadly classified into tree-based indexes, graph-based indexes, hash-based indexes, and quantization-based indexes. In this section, we will cover the two most popular indexing algorithms. When creating an LLM application, you don’t need to dive deep into the indexing process since many vector databases provide this as a service to you. But it’s important to choose the right type of index for your specific needs to ensure efficient data retrieval:

  • Hierarchical navigable small world (HNSW): This is a method for approximate similarity search in highly dimensional spaces. HNSW is a graph-based index that works by creating a hierarchical graph structure, where each node represents a data point, and the edges connect similar data points. This hierarchical structure allows for efficient search operations, as it narrows down the search space quickly. HNSW is well suited for similarity search use cases, such as content-based recommendation systems and text search.

If you wish to dive deeper into its workings, we recommend checking out this research paper: https://arxiv.org/abs/1603.09320.

The following image is a representation of the HNSW index:

Figure 4.4 – Representation of HNSW index

The image illustrates the HNSW graph structure used for efficient similarity searches. The graph is constructed in layers, with decreasing density from the bottom to the top. Each layer’s characteristic radius reduces as we ascend, creating sparser connections. The depicted search path, using the red dotted lines, showcases the algorithm’s strategy; it starts from the sparsest top layer, quickly navigating vast data regions, and then refines its search in the denser lower layers, minimizing the overall comparisons and enhancing search efficiency.

  • Facebook AI Similarity Search (FAISS): FAISS, developed by Facebook AI Research, is a library designed for the efficient similarity search and clustering of highly dimensional vectors. It uses product quantization to compress data during indexing, accelerating similarity searches in vast datasets. This method divides the vector space into regions known as Voronoi cells, each symbolized by a centroid. The primary purpose is to minimize storage needs and expedite searches, though it may slightly compromise accuracy. To visualize this, consider the following image. The Voronoi cells denote regions from quantization, and the labeled points within these cells are the centroids or representative vectors. When indexing a new vector, it’s aligned with its closest centroid. For searches, FAISS pinpoints the probable Voronoi cell containing the nearest neighbors and then narrows down the search within that cell, significantly cutting down distance calculations:

Figure 4.5 – Representation of FAISS index

It excels in applications such as image and video search, recommendation systems, and any task that involves searching for nearest neighbors in highly dimensional spaces because of its performance optimizations and built-in GPU optimization.

In this section, we covered indexing and the role of ANNs in index creation. Next, we’ll explore similarity measures, how they differ from indexing, and their impact on improving data retrieval.

When to Use HNSW vs. FAISS 2 – RAGs to Riches: Elevating AI with External Data

The image illustrates the Euclidean distance formula in a 2D space. It shows two points: (x1,y1) and (x2,y2). The preceding formula calculates the straight-line distance between the two points in a plane.

  • Distance metrics – Manhattan (L1): Manhattan distance calculates the sum of absolute differences along each dimension. The higher the metric, the less similar the differences. The following image depicts the Manhattan distance (or L1 distance) between two points in a 2D space, where the distance is measured along the axes at right angles, similar to navigating city blocks in a grid-like street layout:

Figure 4.8 – Illustration of Manhattan distance

You might be wondering when to select one metric over another during the development of generative AI applications. The decision on which similarity measure to use hinges on various elements, such as the type of data, the context of the application, and the bespoke demands of the analysis results.

Cosine similarity is preferred over Manhattan and Euclidean distances when the magnitude of the data vectors is less relevant than the direction or orientation of the data. In text analysis, for example, two documents might be represented by highly dimensional vectors of word frequencies. If one document is a longer version of the other, their word frequency vectors will point in the same direction, but the magnitude (length) of one vector will be larger due to the higher word count. Using Euclidean or Manhattan distance would highlight these differences in magnitude, suggesting the documents are different. However, using cosine similarity would capture their similarity in content (the direction of the vectors), de -emphasizing the differences in word count. In this context, cosine similarity is more appropriate, as it focuses on the angle between the vectors, reflecting the content overlap of the documents rather than their length or magnitude.

Euclidean and Manhattan distances are more apt than cosine similarity when the magnitude and absolute differences between data vectors are crucial, such as with consistent scaled numerical data (e.g., age, height, weight, and so on) or in spatial applications such as grid-based pathfinding. While cosine similarity emphasizes the orientation or pattern of data vectors, which is especially useful in highly dimensional, sparse datasets, Euclidean and Manhattan distances capture the actual differences between data points, making them preferable in scenarios where absolute value deviations are significant such as when comparing the medical test results of patients or finding the distance between geographical co-ordinates on earth.

The following is a snippet of code that uses Azure OpenAI endpoints to calculate the similarity between two sentences: “What number of countries do you know?” and “How many countries are you familiar with?” by using embedding model text-embedding-ada-002. It gives a score of 0.95:

import os

import openai

openai.api_type = “azure”

openai.api_base = “https://ak-deployment-3.openai.azure.com/”

openai.api_version = “2023-07-01-preview”

##replace “2023-07-01-preview” with latest version openai.api_key = “xxxxxxxxxxxxxxxxxxxxxxx”

def get_embedding(text, model=”text-embedding-ada-002″):

return openai.Embedding.create(engine=model, input=[text], \ model=model)[‘data’][0][’embedding’]

embedding1 = get_embedding(“What number of countries do you know?”, \

model=’text-embedding-ada-002′)

embedding2 = get_embedding(“How many countries are you familiar \

with?”, model=’text-embedding-ada-002′)

embedding1_np = np.array(embedding1)

embedding2_np = np.array(embedding2)

similarity = cosine_similarity([embedding1_np], [embedding2_np])

print(similarity)

# [[0.95523639]]

Now let us walkthrough a scenario where Cosine Similarity will be preferred over Manhattan distance.

When to Use HNSW vs. FAISS – RAGs to Riches: Elevating AI with External Data

When to Use HNSW vs. FAISS

Use HNSW when:

  • High precision in similarity search is crucial.
  • The dataset size is large but not at the scale where managing it becomes impractical for HNSW.
  • Real-time or near-real-time search performance is required.
  • The dataset is dynamic, with frequent updates or insertions.
  • Apt for use cases involving text like article recommendation systems

Use FAISS when:

  • Managing extremely large datasets (e.g., billions of vectors).
  • Batch processing and GPU optimization can significantly benefit the application.
  • There’s a need for flexible trade-offs between search speed and accuracy.
  • The dataset is relatively static, or batch updates are acceptable.
  • Apt for use cases like image and video search.

Note

Choosing the right indexing strategy hinges on several critical factors, including the nature and structure of the data, the types of queries (e.g. range queries, nearest neighbors, exact search) to be supported, and the volume and growth of the data. Additionally, the frequency of data updates (e.g., static vs dynamic) the dimensionality of the data, performance requirements (real-time, batch), and resource constraints play significant roles in the decision-making process.

Similarity measures

Similarity measures dictate how the index is organized, and this makes sure that the retrieved data are highly relevant to the query. For instance, in a system designed to retrieve similar images, the index might be built around the feature vectors of images, and the similarity measure would determine which images are “close” or “far” within that indexed space. The importance of these concepts is two-fold: indexing significantly speeds up data retrieval, and similarity measures ensure that the retrieved data is relevant to the query, together enhancing the efficiency and efficacy of data retrieval systems. Selecting an appropriate distance metric greatly enhances the performance of classification and clustering tasks. The optimal similarity measure is chosen based on the nature of the data input.

In other words, similarity measures define how closely two items or data points are related. They can be broadly classified into distance metrics and similarity metrics. Next, we’ll explore the three top similarity metrics for building AI applications: cosine similarity and Euclidean and Manhattan distance.

  • Similarity metrics – Cosine similarity: Cosine similarity, a type of similarity metric, calculates the cosine value of the angle between two vectors, and OpenAI suggests using it for its models to measure the distance between two embeddings obtained from text-embedding-ada-002. The higher the metric, the more similar they are:

Figure 4.6 – Illustration of relatedness through cosine similarity between two words

The preceding image shows a situation where the cosine similarity is 1 for India and the USA because they are related, as both are countries. In the other image, the similarity is 0 because football is not similar to a lion.

  • Distance metrics – Euclidean (L2): Euclidean distance computes the straight-line distance between two points in Euclidean space. The higher the metric, the less similar the two points are:

Figure 4.7 – Illustration of Euclidean distance

Recommendation System for Articles – RAGs to Riches: Elevating AI with External Data

Recommendation System for Articles

Let’s consider a scenario where a news aggregation platform aims to recommend articles similar to what a user is currently reading, enhancing user engagement by suggesting relevant content.

How It Works:

  • Preprocessing and Indexing: Articles in the platform’s database are processed to extract textual features, often converted into high-dimensional vectors using LDA or transformer based embeddings like text-ada-embedding-002. These vectors are then indexed using HNSW, an algorithm suitable for high-dimensional spaces due to its hierarchical structure that facilitates efficient navigation and search.
  • Retrieval Time: When a user reads an article, the system generates a feature vector for this article and queries the HNSW index to find vectors (and thus articles) that are close in the high-dimensional space. Cosine similarity can be used to evaluate the similarity between the query article’s vector and those in the index, identifying articles with similar content.
  • Outcome: The system recommends a list of articles ranked by their relevance to the currently viewed article. Thanks to the efficient indexing and similarity search, these recommendations are generated quickly, even from a vast database of articles, providing the user with a seamless experience.

Now let us walkthrough a scenario where Manhattan Distance will be preferred over Cosine Similarity.

Ride-Sharing App Matchmaking

Let’s consider a scenario where a ride-sharing application needs to match passengers with nearby drivers efficiently. The system must quickly find the closest available drivers to a passenger’s location to minimize wait times and optimize routes.

How It Works:

  • Preprocessing and Indexing: Drivers’ current locations are constantly being updated and stored as points in a 2D space representing a map. These points can be indexed using a tree based spatial indexing techniques or data structures optimized for geospatial data, such as R-trees.
  • Retrieval Time: When a passenger requests a ride, the application uses the passenger’s current location as a query point. Manhattan distance (L1 norm) is particularly suitable for urban environments, where movement is constrained by a grid-like structure of streets and avenues, mimicking the actual paths a car would take along city blocks.
  • Outcome: The system quickly identifies the nearest available drivers using the indexed data and Manhattan distance calculations, considering the urban grid’s constraints. This process

ensures a swift  matchmaking process, improving the user experience by reducing wait times.

Vector stores – RAGs to Riches: Elevating AI with External Data

Vector stores

As generative AI applications continue to push the boundaries of what’s possible in tech, vector stores have emerged as a crucial component, streamlining and optimizing the search and retrieval of relevant data. In our previous discussions, we’ve delved into the advantages of vector DBs over traditional databases, unraveling the concepts of vectors, embeddings, vector search strategies, approximate nearest neighbors (ANNs), and similarity measures. In this section, we aim to provide an integrative understanding of these concepts within the realm of vector DBs and libraries.

The image illustrates a workflow for transforming different types of data—Audio, Text, and Videos— into vector embeddings.

  • Audio: An audio input is processed through an “Audio Embedding model,” resulting in “Audio vector embeddings.”
  • Text: Textual data undergoes processing in a “Text Embedding model,” leading to “Text vector embeddings.”
  • Videos: Video content is processed using a “Video Embedding model,” generating “Video vector embeddings.”

Once these embeddings are created, they are subsequently utilized (potentially in an enterprise vector database system) to perform “Similarity Search” operations. This implies that the vector embeddings can be compared to find similarities, making them valuable for tasks such as content recommendations, data retrieval, and more.

Figure 4.9 – Multimodal embeddings process in an AI application

What is a vector database?

A vector database (vector DB) is a specialized database designed to handle highly dimensional vectors primarily generated from embeddings of complex data types such as text, images, or audio. It provides capabilities to store and index unstructured data and enhance searches, as well as retrieval capabilities as a service.

Modern vector databases that are brimming with advancements empower you to architect resilient enterprise solutions. Here, we list 15 key features to consider when choosing a vector DB. Every feature may not be important for your use case, but it might be a good place to start. Keep in mind that this area is changing fast, so there might be more features emerging in the future:

  • Indexing: As mentioned earlier, indexing refers to the process of organizing highly dimensional vectors in a way that allows for efficient similarity searches and retrievals. A vector DB offers built-in indexing features designed to arrange highly dimensional vectors for swift and effective similarity-based searches and retrievals. Previously, we discussed indexing algorithms such as FAISS and HNSW. Many vector DBs incorporate such features natively. For instance, Azure AI Search integrates the HNSW indexing service directly.
  • Search and retrieval: Instead of relying on exact matches, as traditional databases do, vector DBs provide vector search capabilities as a service, such as approximate nearest neighbors (ANNs), to quickly find vectors that are roughly the closest to a given input. To quantify the closeness or similarity between vectors, they utilize similarity measures such as cosine similarity or Euclidean distance, enabling efficient and nuanced similarity-based searches in large datasets.
  • Create, read, update, and delete: A vector DB manages highly dimensional vectors and offers create, read, update, and delete (CRUD) operations tailored to vectorized data. When vectors are created, they’re indexed for efficient retrieval. Reading often means performing similarity searches to retrieve vectors closest to a given query vector, typically using methods such as ANNs. Vectors can be updated, necessitating potential re-indexing, and they can also be deleted, with the database adjusting its internal structures accordingly to maintain efficiency and consistency.
  • Security: This meets GDPR, SOC2 Type II, and HIPAA rules to easily manage access to the console and use SSO. Data is encrypted when stored and in transit, which also provides more granular identity and access management features.
  • Serverless: A high-quality vector database is designed to gracefully autoscale with low management overhead as data volumes soar into millions or billions of entries, distributing seamlessly across several nodes. Optimal vector databases grant users the flexibility to adjust the system in response to shifts in data insertion, query frequencies, and underlying hardware configurations.
  • Hybrid search: Hybrid search combines traditional keyword-based search methods with other search mechanisms, such as semantic or contextual search, to retrieve results from both the exact term matches and by understanding the underlying intent or context of the query, ensuring a more comprehensive and relevant set of results.
  • Semantic re-ranking: This is a secondary ranking step to improve the relevance of search results. It re-ranks the search results that were initially scored by state-of-the-art ranking algorithms such as BM25 and RRF based on language understanding. For instance, Azure AI search employs secondary ranking that uses multi-lingual, deep learning models derived from Microsoft Bing to elevate the results that are most relevant in terms of meaning.
  • Auto vectorization/embedding: Auto-embedding in a vector database refers to the automatic process of converting data items into vector representations for efficient similarity searches and retrieval, with access to multiple embedding models.
  • Data replication: This ensuresdata availability, redundancy, and recovery in case of failures, safeguarding business continuity and reducing data loss risks.
  • Concurrent user access and data isolation: Vector databases support a large number of users concurrently and ensure robust data isolation to ensure updates remain private unless deliberately shared.
  • Auto-chunking: Auto-chunking is the automated process of dividing a larger set of data or content into smaller, manageable pieces or chunks for easier processing or understanding. This process helps preserve the semantic relevance of texts and addresses the token limitations of embedding models. We will learn more about chunking strategies in the upcoming sections in this chapter.
  • Extensive interaction tools: Prominent vector databases, such as Pinecone, offer versatile APIs and SDKs across languages, ensuring adaptability in integration and management.
  • Easy integration: Vector DBs provide seamless integration with LLM orchestration frameworks and SDKs, such as Langchain and Semantic Kernel, and leading cloud providers, such as Azure, GCP, and AWS.
  • User-friendly interface: Thisensures an intuitive platform with simple navigation and direct feature access, streamlining the user experience.
  • Flexible pricing models: Provides flexible pricing models as per user needs to keep the costs low for the user.
  • Low downtime and high resiliency: Resiliency in a vector database (or any database) refers to its ability to recover quickly from failures, maintain data integrity, and ensure continuous availability even in the face of adverse conditions, such as hardware malfunctions, software bugs, or other unexpected disruptions.

As of early 2024, a few prominent open source vector databases include Chroma, Milvus, Quadrant, and Weaviate, while Pinecone and Azure AI search are among the leading proprietary solutions.

Vector DB limitations – RAGs to Riches: Elevating AI with External Data

Vector DB limitations

  • Accuracy vs. speed trade-off: When dealing with highly dimensional data, vector DBs often face a trade-off between speed and accuracy for similarity searches. The core challenge stems from the computational expense of searching for the exact nearest neighbors in large datasets. To enhance search speed, techniques such as ANNs are employed, which quickly identify “close enough” vectors rather than the exact matches. While ANN methods can dramatically boost query speeds, they may sometimes sacrifice pinpoint accuracy, potentially missing the true nearest vectors. Certain vector index methods, such as product quantization, enhance storage efficiency and accelerate queries by condensing and consolidating data at the expense of accuracy.
  • Quality of embedding: The effectiveness of a vector database is dependent on the quality of the vector embedding used. Poorly designed embeddings can lead to inaccurate search results or missed connections.
  • Complexity: Implementing and managing vector databases can be complex, requiring specialized knowledge about vector search strategy indexing and chunking strategies to optimize for specific use cases.

Vector libraries

Vector databases may not always be necessary. Small-scale applications may not require all the advanced features that vector DBs provide. In those instances, vector libraries become very valuable. Vector libraries are usually sufficient for small, static data and provide the ability to store in memory, index, and use similarity search strategies. However, they may not provide features such as CRUD support, data replication, and being able to store data on disk, and hence, the user will have to wait for a full import before they can query. Facebook’s FAISS is a popular example of a vector library.

As a rule of thumb, if you are dealing with millions/billions of records and storing data that are changing frequently, require millisecond response times, and more long-term storage capabilities on disk, it is recommended to use vector DBs over vector libraries.

Vector DBs vs. traditional databases – Understanding the key differences

As stated earlier, vector databases have become pivotal, especially in the era of generative AI, because they facilitate efficient storage, querying, and retrieval of highly dimensional vectors that are nothing but numerical representations of words or sentences often produced by deep learning models. Traditional scalar databases are designed to handle discrete and simple data types, making them ill-suited for the complexities of large-scale vector data. In contrast, vector databases are optimized for similar searches in the vector space, enabling the rapid identification of vectors that are “close” or “similar” in highly dimensional spaces. Unlike conventional data models such as relational databases, where queries commonly resemble “retrieve the books borrowed by a particular member” or “identify the items currently discounted,” vector queries primarily seek similarities among vectors based on one or more reference vectors. In other words, queries might look like “identify the top 10 images of dogs similar to the dog in this photo” or “locate the best cafes near my current location.” At retrieval time, vector databases are crucial, as they facilitate the swift and precise retrieval of relevant document embeddings to augment the generation process. This technique is also called RAG, and we will learn more about it in the later sections.

Imagine you have a database of fruit images, and each image is represented by a vector (a list of numbers) that describes its features. Now, let’s say you have a photo of an apple, and you want to find similar fruits in your database. Instead of going through each image individually, you convert your apple photo into a vector using the same method you used for the other fruits. With this apple vector in hand, you search the database to find vectors (and therefore images) that are most similar or closest to your apple vector. The result would likely be other apple images or fruits that look like apples based on the vector representation.

Figure 4.10 – Vector represenation