Category Vector search strategies

Real-life examples of fine-tuning success – Fine-Tuning – Building Domain-Specific LLM Applications

Real-life examples of fine-tuning success

In this section, we’ll explore a real-life example of a fine-tuning approach that OpenAI implemented, which yielded remarkable outcomes.

InstructGPT

OpenAI’s InstructGPT is one of the most successful stories of fine-tuned models that laid the foundation of ChatGPT. ChatGPT is said to be a sibling model to InstructGPT. The methods that are used to fine-tune ChatGPT are similar to InstructGPT. InstructGPT was created by fine-tuning pre-trained GPT-3 models with RHLF. Supervised fine-tuning is the first step in RLHF for generating responses aligned to human preferences.

In the beginning, GPT-3 models weren’t originally designed to adhere to user instructions. Their training focused on predicting the next word based on vast amounts of internet text data. Therefore, these models underwent fine-tuning using instructional datasets along with RLHF to enhance their ability to generate more useful and relevant responses aligned with human values when prompted with user instructions:

Figure 3.20 – The fine-tuning process with RLHF

This figure depicts a schematic representation showcasing the InstructGPT fine- tuning process: (1) initial supervised fine-tuning, (2) training the reward model, and (3) executing RL through PPO using this established reward model. The utilization of this data to train respective models is indicated by the presence of blue arrows. In step 2, boxes A-D are samples from models that get ranked by labelers.

The following figure provides a comparison of the response quality of fine-tuned models with RLHF, supervised fine-tuned models, and general GPT models. The Y-axis consists of a Likert scale and shows quality ratings of model outputs on a 1–7 scale (Y-axis), for various model sizes (X-axis), on prompts submitted to InstructGPT models via the OpenAI API. The results reveal that InstructGPT outputs receive significantly higher scores by labelers compared to outputs from GPT-3 models with both few-shot prompts and those without, as well as models that underwent supervised learning fine-tuning. The labelers that were hired for this work were independent and were sourced from Scale AI and Upwork:

Figure 3.21 – Evaluation of InstructGPT (image credits: Open AI)

InstructGPT can be assessed across dimensions of toxicity, truthfulness, and appropriateness. Higher scores are desirable for TruthfulQA and appropriateness, whereas lower scores are preferred for toxicity and hallucinations. Measurement of hallucinations and appropriateness is conducted based on the distribution of prompts within our API. The outcomes are aggregated across various model sizes:

Figure 3.22 – Evaluation of InstructGPT

In this section, we introduced the concept of fine-tuning and discussed a success stories of fine-tuning with RLHF that led to the development of InstructGPT.

Summary

Fine-tuning is a powerful technique for customizing models, but it may not always be necessary. As observed, it can be time-consuming and may have initial upfront costs. It’s advisable to start with easier and faster strategies, such as prompt engineering with few- shot examples, followed by data grounding using RAG. Only if the responses from the LLM remain suboptimal should you consider fine-tuning. We will discuss RAG and prompt engineering in the following chapters.

In this chapter, we delved into critical fine-tuning strategies tailored for specific tasks. Then, we explored an array of evaluation methods and benchmarks to assess your refined model. The RLHF process ensures your models align with human values, making them helpful, honest, and safe. In the upcoming chapter, we’ll tackle RAG methods paired with vector databases – an essential technique to ground your enterprise data and minimize hallucinations in LLM-driven applications.

When to Use HNSW vs. FAISS 2 – RAGs to Riches: Elevating AI with External Data

The image illustrates the Euclidean distance formula in a 2D space. It shows two points: (x1,y1) and (x2,y2). The preceding formula calculates the straight-line distance between the two points in a plane.

  • Distance metrics – Manhattan (L1): Manhattan distance calculates the sum of absolute differences along each dimension. The higher the metric, the less similar the differences. The following image depicts the Manhattan distance (or L1 distance) between two points in a 2D space, where the distance is measured along the axes at right angles, similar to navigating city blocks in a grid-like street layout:

Figure 4.8 – Illustration of Manhattan distance

You might be wondering when to select one metric over another during the development of generative AI applications. The decision on which similarity measure to use hinges on various elements, such as the type of data, the context of the application, and the bespoke demands of the analysis results.

Cosine similarity is preferred over Manhattan and Euclidean distances when the magnitude of the data vectors is less relevant than the direction or orientation of the data. In text analysis, for example, two documents might be represented by highly dimensional vectors of word frequencies. If one document is a longer version of the other, their word frequency vectors will point in the same direction, but the magnitude (length) of one vector will be larger due to the higher word count. Using Euclidean or Manhattan distance would highlight these differences in magnitude, suggesting the documents are different. However, using cosine similarity would capture their similarity in content (the direction of the vectors), de -emphasizing the differences in word count. In this context, cosine similarity is more appropriate, as it focuses on the angle between the vectors, reflecting the content overlap of the documents rather than their length or magnitude.

Euclidean and Manhattan distances are more apt than cosine similarity when the magnitude and absolute differences between data vectors are crucial, such as with consistent scaled numerical data (e.g., age, height, weight, and so on) or in spatial applications such as grid-based pathfinding. While cosine similarity emphasizes the orientation or pattern of data vectors, which is especially useful in highly dimensional, sparse datasets, Euclidean and Manhattan distances capture the actual differences between data points, making them preferable in scenarios where absolute value deviations are significant such as when comparing the medical test results of patients or finding the distance between geographical co-ordinates on earth.

The following is a snippet of code that uses Azure OpenAI endpoints to calculate the similarity between two sentences: “What number of countries do you know?” and “How many countries are you familiar with?” by using embedding model text-embedding-ada-002. It gives a score of 0.95:

import os

import openai

openai.api_type = “azure”

openai.api_base = “https://ak-deployment-3.openai.azure.com/”

openai.api_version = “2023-07-01-preview”

##replace “2023-07-01-preview” with latest version openai.api_key = “xxxxxxxxxxxxxxxxxxxxxxx”

def get_embedding(text, model=”text-embedding-ada-002″):

return openai.Embedding.create(engine=model, input=[text], \ model=model)[‘data’][0][’embedding’]

embedding1 = get_embedding(“What number of countries do you know?”, \

model=’text-embedding-ada-002′)

embedding2 = get_embedding(“How many countries are you familiar \

with?”, model=’text-embedding-ada-002′)

embedding1_np = np.array(embedding1)

embedding2_np = np.array(embedding2)

similarity = cosine_similarity([embedding1_np], [embedding2_np])

print(similarity)

# [[0.95523639]]

Now let us walkthrough a scenario where Cosine Similarity will be preferred over Manhattan distance.

Vector DB limitations – RAGs to Riches: Elevating AI with External Data

Vector DB limitations

  • Accuracy vs. speed trade-off: When dealing with highly dimensional data, vector DBs often face a trade-off between speed and accuracy for similarity searches. The core challenge stems from the computational expense of searching for the exact nearest neighbors in large datasets. To enhance search speed, techniques such as ANNs are employed, which quickly identify “close enough” vectors rather than the exact matches. While ANN methods can dramatically boost query speeds, they may sometimes sacrifice pinpoint accuracy, potentially missing the true nearest vectors. Certain vector index methods, such as product quantization, enhance storage efficiency and accelerate queries by condensing and consolidating data at the expense of accuracy.
  • Quality of embedding: The effectiveness of a vector database is dependent on the quality of the vector embedding used. Poorly designed embeddings can lead to inaccurate search results or missed connections.
  • Complexity: Implementing and managing vector databases can be complex, requiring specialized knowledge about vector search strategy indexing and chunking strategies to optimize for specific use cases.

Vector libraries

Vector databases may not always be necessary. Small-scale applications may not require all the advanced features that vector DBs provide. In those instances, vector libraries become very valuable. Vector libraries are usually sufficient for small, static data and provide the ability to store in memory, index, and use similarity search strategies. However, they may not provide features such as CRUD support, data replication, and being able to store data on disk, and hence, the user will have to wait for a full import before they can query. Facebook’s FAISS is a popular example of a vector library.

As a rule of thumb, if you are dealing with millions/billions of records and storing data that are changing frequently, require millisecond response times, and more long-term storage capabilities on disk, it is recommended to use vector DBs over vector libraries.

Vector DBs vs. traditional databases – Understanding the key differences

As stated earlier, vector databases have become pivotal, especially in the era of generative AI, because they facilitate efficient storage, querying, and retrieval of highly dimensional vectors that are nothing but numerical representations of words or sentences often produced by deep learning models. Traditional scalar databases are designed to handle discrete and simple data types, making them ill-suited for the complexities of large-scale vector data. In contrast, vector databases are optimized for similar searches in the vector space, enabling the rapid identification of vectors that are “close” or “similar” in highly dimensional spaces. Unlike conventional data models such as relational databases, where queries commonly resemble “retrieve the books borrowed by a particular member” or “identify the items currently discounted,” vector queries primarily seek similarities among vectors based on one or more reference vectors. In other words, queries might look like “identify the top 10 images of dogs similar to the dog in this photo” or “locate the best cafes near my current location.” At retrieval time, vector databases are crucial, as they facilitate the swift and precise retrieval of relevant document embeddings to augment the generation process. This technique is also called RAG, and we will learn more about it in the later sections.

Imagine you have a database of fruit images, and each image is represented by a vector (a list of numbers) that describes its features. Now, let’s say you have a photo of an apple, and you want to find similar fruits in your database. Instead of going through each image individually, you convert your apple photo into a vector using the same method you used for the other fruits. With this apple vector in hand, you search the database to find vectors (and therefore images) that are most similar or closest to your apple vector. The result would likely be other apple images or fruits that look like apples based on the vector representation.

Figure 4.10 – Vector represenation

Example of an RAG workflow – RAGs to Riches: Elevating AI with External Data

Example of an RAG workflow

Let’s walk through as an example step by step, as shown in the image. Imagine a platform where users can ask about ongoing cricket matches, including recent performances, statistics, and trivia:

  1. Suppose the user asks, “How did Virat Kohli perform in the last match, and what’s an interesting fact from that game?” Since the LLM was trained until April 2023, the LLM may not have this answer.
  2. The retrieval model will embed the query and send it to a vector DB.
  3. All the latest cricket news is stored in a vector DB in a properly indexed format using ANN strategies such as HNSW. The vector DB performs a cosine similarity with the indexed information and provides a few relevant results or contexts.
  4. The retrieved context is then sent to the LLM along with the query to synthesize the information and provide a relevant answer.
  5. The LLM provides the relevant answer: “Virat Kohli scored 85 runs off 70 balls in the last match. An intriguing detail from that game is that it was the first time in three years that he hit more than seven boundaries in an ODI inning.”

The following image illustrates the preceding points:

Figure 4.11 – Representation of RAG workflow with vector database

Business applications of RAG

In the following list, we have mentioned a few popular business applications of RAG based on what we’ve seen in the industry:

  • Enterprise search engines: One of the most prominent applications of RAG is in the realm of enterprise learning and development, serving as a search engine for employee upskilling. Employees can pose questions about the company, its culture, or specific tools, and RAG swiftly delivers accurate and relevant answers.
  • Legal and compliance: RAG fetches relevant case laws or checks business practices against regulations.
  • Ecommerce: RAG suggests products or summarizes reviews based on user behavior and queries.
  • Customer support: RAG provides precise answers to customer queries by pulling information from the company’s knowledge base and providing solutions in real time.
  • Medical and healthcare: RAG retrieves pertinent medical research or provides preliminary symptom-based suggestions.

Evaluation of RAG using Azure Prompt Flow – RAGs to Riches: Elevating AI with External Data

Evaluation of RAG using Azure Prompt Flow

Up to this point, we have discussed the development of resilient RAG applications. However, the question arises: How can we determine whether these applications are functioning as anticipated and if the context they retrieve is pertinent? While manual validation—comparing the responses generated by LLMs against ground truth—is possible, this method proves to be labor-intensive, costly, and challenging to execute on a large scale. Consequently, it’s essential to explore methodologies that facilitate automated evaluation on a vast scale. Recent research has delved into the concept of utilizing “LLM as a judge” to assess output, a strategy that Azure Prompt Flow incorporates within its offerings.

Azure Prompt Flow has built-in and structured metaprompt templates with comprehensive guardrails to evaluate your output against ground truth. The following mentions four metrics that can help you evaluate your RAG solution in Prompt Flow:

  • Groundedness: Measures the alignment of the model’s answers with the input source, making sure the model’s generated response is not fabricated. The model must always extract information from the provided “context” while responding to user’s query.
  • Relevance: Measures the degree to which the model’s generated response is closely connected to the context and user query.
  • Retrieval score: Measures the extent to which the model’s retrieved documents are pertinent and directly related to the given questions.
  • Custom metrics: While the above three are the most important for evaluating RAG applications, Prompt Flow allows you to use custom metrics, too. Bring your own LLM as a judge and define your own metrics by modifying the existing metaprompts. This also allows you to use open source models such as Llama and to build your own metrics from code with Python functions. The above evaluations are more no-code or low-code friendly; however, for a more pro-code friendly approach, azureml-metrics SDK, such as ROUGE, BLEU, F1-Score, Precision, and Accuracy, can be utilized as well.

The field is advancing quickly, so we recommend regularly checking Azure ML Prompt Flow’s latest updates on evaluation metrics. Start with the “Manual Evaluation” feature in Prompt Flow to gain a basic understanding of LLM performance. It’s important to use a mix of metrics for a thorough evaluation that captures both semantic and syntactic essence rather than relying on just one metric to compare the responses with the actual ground truth.

Case study – Global chat application deployment by a multinational organization – RAGs to Riches: Elevating AI with External Data

Case study – Global chat application deployment by a multinational organization

A global firm recently launched an advanced internal chat application featuring a Q&A support chatbot. This innovative tool, deployed across various Azure regions, integrates several large language models, including the specialized finance model, BloombergGPT. To meet specific organizational requirements, bespoke plugins were developed. It had an integration with Service Now, empowering the chatbot to streamline ticket generation and oversee incident actions.

In terms of data refinement, the company meticulously preprocessed its knowledge base (KB) information, eliminating duplicates, special symbols, and stop words. The KB consisted of answers to frequently asked questions and general information to various support-related questions. They employed fixed chunking approaches, exploring varied chunk sizes, before embedding these data into the Azure AI search. Their methodology utilized Azure OpenAI’s text-ada-embedding-002 models in tandem with the cosine similarity metric and Azure AI search’s vector search capabilities.

From their extensive testing, they discerned optimal results with a chunk size of 512 tokens and a 10% overlap. Moreover, they adopted an ANN vector search methodology using cosine similarity. They also incorporated hybrid search that included keyword and semantic search with Semantic Reranker. Their RAG workflow, drawing context from Azure Vector Search and the GPT 3.5 Turbo-16K models, proficiently generated responses to customer support inquiries. They implemented caching techniques using Azure Cache Redis and rate-limiting strategies using Azure API Management to optimize the costs.

The integration of the support Q&A chatbot significantly streamlined the multinational firm’s operations, offering around-the-clock, consistent, and immediate responses to queries, thereby enhancing user satisfaction. This not only brought about substantial cost savings by reducing human intervention but also ensured scalability to handle global demands. By automating tasks such as ticket generation, the firm gained deeper insights into user interactions, allowing for continuous improvement and refinement of their services.

Summary

In this chapter, we explored the RAG approach, a powerful method for leveraging your data to craft personalized experiences, reduce hallucinations while also addressing the training limitations inherent in LLMs. Our journey began with an examination of foundational concepts such as vectors and databases, with a special focus on Vector Databases. We understood the critical role that Vector DBs play in the development of RAG-based applications, also highlighting how they can enhance LLM responses through effective chunking strategies. The discussion also covered practical insights on building engaging RAG experiences, evaluating them through prompt flow, and included a hands-on lab available on GitHub to apply what we’ve learned.

In the next chapter we will introduce another popular technique designed to minimize hallucinations and more easily steer the responses of LLMs. We will cover prompt engineering strategies, empowering you to fully harness the capabilities of your LLMs and engage more effectively with AI. This exploration will provide you with the tools and knowledge to enhance your interactions with AI, ensuring more reliable and contextually relevant outputs.

What is prompt engineering? – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

What is prompt engineering?

Prompt engineering is the art of crafting or designing prompts to unlock desired outcomes from large language models or AI systems. The concept of prompt engineering revolves around the fundamental idea that the quality of your response is intricately tied to the quality of the question you pose. By strategically engineering prompts, one can influence the generated outputs and improve the overall performance and usefulness of the system. In this section, we will learn about the necessary elements of effective prompt design, prompt engineering techniques, best practices, bonus tips, and tricks.

Elements of a good prompt design

Designing a good prompt is important because it significantly influences the output of a language model such as GPT. The prompt provides the initial context, sets the task, guides the style and structure of the response, reduces ambiguities and hallucinations, and supports the optimization of resources, thereby reducing costs and energy use. In this section, let’s understand the elements of good prompt design.

The foundational elements of a good prompt include instructions, questions, input data, and examples:

  • Instructions: The instructions in a prompt refer to the specific guidelines or directions given to a language model within the input text to guide the kind of response it should produce.
  • Questions: Questions in a prompt refer to queries or interrogative statements that are included in the input text. The purpose of these questions is to instruct the language model to provide a response or an answer to the query. In order to obtain the results, either the question or instruction is mandatory.
  • Input data: The purpose of input data is to provide any additional supporting context when prompting the LLM. It could be used to provide new information the model has not previously been trained on for more personalized experiences.
  • Examples: The purpose of examples in a prompt is to provide specific instances or scenarios that illustrate the desired behavior or response from ChatGPT. You can input a prompt that includes one or more examples, typically in the form of input-output pairs.

The following table shows how to build effective prompts using the aforementioned prompt elements:

Figure 5.5 – Sample Prompt formula consisting of prompt elements with examples

Assistant – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Assistant

While not mandatory, assistant messages can be included to illustrate the preferred conduct or response patterns of the chatbot.

Now let’s look at an example; if you are developing an application for a fitness AI assistant, a very well-rounded system message might look as follows:

openai.ChatCompletion.create(

model=”gpt-3.5-turbo”,

messages=[

{

“role”: “system”,

“content”:

“You are a fitness AI assistant. Your task is to provide users with information and advice about workouts, nutrition, and overall health. You will help users by answering their questions

to the best of your abilities and providing suggestions based on their input. Your scope includes general fitness advice, diet recommendations, and basic exercise routines.

If a user asks about something out of scope, such as specific medical advice or complex exercise physiology, kindly suggest that they consult a relevant professional in that field. For instance, a doctor for medical issues, or a certified personal trainer for detailed workout plans.

You should maintain a supportive and positive tone, being respectful and encouraging to all users, regardless of their current fitness level or knowledge. Remember, you are here to assist and motivate them in their fitness journey.

Your responses should be concise, easy to understand, and practically useful. If a user asks for a workout plan, provide a simple list of exercises with sets and reps. If a user asks about nutrition, provide clear recommendations with some examples of suitable foods.”

},

{“role”: “user”, “content”: “I’m new to fitness, can you suggest a simple workout plan to get started?”},

{

“role”: “assistant”,

“content”:

“Absolutely, I’m here to help you get started on your fitness journey. Here’s a simple workout plan you can begin with:

1. Warm-up: 5-10 minutes of brisk walking or light jogging

2. Squats: 2 sets of 10 reps

3. Push-ups: 2 sets of 10 reps (you can start with knee push-ups if full push-ups are too challenging)

4. Lunges: 2 sets of 10 reps per leg

5. Plank: Hold for 30 seconds, repeat 2 times

6. Cool down: 5-10 minutes of stretching

Remember, the key to a successful workout routine is consistency. Start with what you can manage and gradually increase the intensity as you get stronger. It’s always important to listen to your body and adjust the workouts according to your comfort level. Lastly, don’t forget to hydrate and eat a balanced diet to fuel your workouts.

Please consult a fitness professional if you’re unsure about the correct form for any of these exercises to avoid injury. Enjoy your workout!”

}

]

)

The following link provides great examples of “Act As” prompts for you to leverage in your system message:

Awesome ChatGPT Prompts (https://github.com/f/awesome-chatgpt-prompts/).

Techniques for effective prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Techniques for effective prompt engineering

In the past two years, a wide array of prompt -engineering techniques have been developed. This section focuses on the essential ones, offering key strategies that you might find indispensable for daily interactions with ChatGPT and other LLM-based applications.

N-shot prompting

N-shot prompting is a term used in the context of training large language models, particularly for zero-shot or few-shot learning tasks. It is also called in-context learning and refers to the techniqueof providing the model with example prompts along with corresponding responses during training to steer the model’s behavior to provide more accurate responses.

The “N” in “N-shot” refers to the number of example prompts provided to the model. For instance, in a one-shot learning scenario, only one example prompt and its response are given to the model. In an N-shot learning scenario, multiple example prompts and responses are provided.

While ChatGPT works great with zero-shot prompting, it may sometimes be useful to provide examples for a more accurate response. Let’s see some examples of zero-shot and few-shot prompting:

Figure 5.8 – N-shot prompting examples

Chain-of-thought (CoT) prompting

Chain-of -thought prompting refers to a sequence of intermediate reasoning steps, significantly boosting the capability of large language models to tackle complex reasoning tasks. By presenting a few chain-of-thought demonstrations as examples in the prompts, the models proficiently handle intricate reasoning tasks:

Figure 5.9 – Chain-of-Thought Prompting Examples

Figure sourced from https://arxiv.org/pdf/2201.11903.pdf.

Program-aided language (PAL) models

Program- aided language (PAL) models, also called program-of -thought prompting ( PoT), is a technique that incorporates additional task-specific instructions, pseudo-code, rules, or programs alongside the free-form text to guide the behavior of a language model:

Figure 5.10 – Program-aided language prompting examples

Figure sourced from https://arxiv.org/abs/2211.10435.

In this section, although we have not explored all prompt engineering techniques (only the most important ones), we want to convey to our readers that there are numerous variants of these techniques, as illustrated in the following figure from the research paper A Systematic Survey of prompt engineering in Large Language Models: Techniques and Applications (https://arxiv.org/pdf/2402.07927. pdf). This paper provides an extensive inventory of prompt engineering strategies across various application areas, showcasing the evolution and breadth of this field over the last four years:

Figure 5.11 – Taxonomy of prompt engineering techniques across multiple application domains

Bonus tips and tricks – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Bonus tips and tricks

The following list provides some helpful bonus tips and tricks:

  • Use of tags: Tags, such as <begin>, <end>, and <|endofprompt|>, that determine the beginning and end of prompts can help separate the different elements of a prompt. This can help generate high-quality output.
  • Use of languages: Though ChatGPT performs best with English, it can be used to generate responses in several other languages.
  • Obtaining the most accurate, up-to-date information: This can be achieved by using the grounding process with a retrieval augmented generation (RAG) architecture and plugins, as discussed in Chapter 4 already. This helps in addressing the knowledge cutoff limitation of LLMs.

Ethical guidelines for prompt engineering

Prompt engineering is a critical stage where AI behavior is molded, and incorporating ethics at this level helps ensure that AI language models are developed and deployed responsibly. It promotes fairness, transparency, and user trust while avoiding potential risks and negative societal impact.

While Chapter 4 delved further into constructing ethical generative AI solutions, in this section, our focus will be on briefly discussing the integration of ethical approaches at the prompt engineering level:

  • Diverse and representative data
  • When fine-tuning the model with few-shot examples, use training data that represent diverse perspectives and demographics.
  • If the AI language model is intended for healthcare, the training data should cover medical cases from different demographics and regions.
  • For instance, if a user poses a question to the LLM, such as, “Can you describe some global traditional festivals?” the response should offer a comprehensive view that encompasses a multitude of countries rather than focusing on just one. This can be ensured by including diverse few-shot examples in the prompts.
  • Bias detection and mitigation
  • Identify and address biases in the model’s outputs to ensure fairness.
  • Implementing debiasing techniques to reduce gender or racial biases.
  • Ensuring that generated content related to sensitive topics is neutral and unbiased.
  • For instance, if a user asks the LLM, “What is the gender of a nurse?” improperly trained models might default to “female” due to biases in their training data. To address this, it’s vital to incorporate few-shot examples that emphasize nurses can be of any gender, be it male or female.
  • Reduce misinformation and disinformation
  • As AI language models can inadvertently generate false or misleading information due to model “hallucinations,” implement measures to minimize the spread of misinformation and disinformation through carefully crafted prompts and responses.
  • For example, based on the guidelines from the prompt engineering section and Chapter 3’s grounding techniques, system prompts should clearly state their scope, such as, “Yourscope is XYZ.” If a user asks about something outside this, such as ABC, the system should have a set response.