Real-life examples of fine-tuning success – Fine-Tuning – Building Domain-Specific LLM Applications

Real-life examples of fine-tuning success

In this section, we’ll explore a real-life example of a fine-tuning approach that OpenAI implemented, which yielded remarkable outcomes.

InstructGPT

OpenAI’s InstructGPT is one of the most successful stories of fine-tuned models that laid the foundation of ChatGPT. ChatGPT is said to be a sibling model to InstructGPT. The methods that are used to fine-tune ChatGPT are similar to InstructGPT. InstructGPT was created by fine-tuning pre-trained GPT-3 models with RHLF. Supervised fine-tuning is the first step in RLHF for generating responses aligned to human preferences.

In the beginning, GPT-3 models weren’t originally designed to adhere to user instructions. Their training focused on predicting the next word based on vast amounts of internet text data. Therefore, these models underwent fine-tuning using instructional datasets along with RLHF to enhance their ability to generate more useful and relevant responses aligned with human values when prompted with user instructions:

Figure 3.20 – The fine-tuning process with RLHF

This figure depicts a schematic representation showcasing the InstructGPT fine- tuning process: (1) initial supervised fine-tuning, (2) training the reward model, and (3) executing RL through PPO using this established reward model. The utilization of this data to train respective models is indicated by the presence of blue arrows. In step 2, boxes A-D are samples from models that get ranked by labelers.

The following figure provides a comparison of the response quality of fine-tuned models with RLHF, supervised fine-tuned models, and general GPT models. The Y-axis consists of a Likert scale and shows quality ratings of model outputs on a 1–7 scale (Y-axis), for various model sizes (X-axis), on prompts submitted to InstructGPT models via the OpenAI API. The results reveal that InstructGPT outputs receive significantly higher scores by labelers compared to outputs from GPT-3 models with both few-shot prompts and those without, as well as models that underwent supervised learning fine-tuning. The labelers that were hired for this work were independent and were sourced from Scale AI and Upwork:

Figure 3.21 – Evaluation of InstructGPT (image credits: Open AI)

InstructGPT can be assessed across dimensions of toxicity, truthfulness, and appropriateness. Higher scores are desirable for TruthfulQA and appropriateness, whereas lower scores are preferred for toxicity and hallucinations. Measurement of hallucinations and appropriateness is conducted based on the distribution of prompts within our API. The outcomes are aggregated across various model sizes:

Figure 3.22 – Evaluation of InstructGPT

In this section, we introduced the concept of fine-tuning and discussed a success stories of fine-tuning with RLHF that led to the development of InstructGPT.

Summary

Fine-tuning is a powerful technique for customizing models, but it may not always be necessary. As observed, it can be time-consuming and may have initial upfront costs. It’s advisable to start with easier and faster strategies, such as prompt engineering with few- shot examples, followed by data grounding using RAG. Only if the responses from the LLM remain suboptimal should you consider fine-tuning. We will discuss RAG and prompt engineering in the following chapters.

In this chapter, we delved into critical fine-tuning strategies tailored for specific tasks. Then, we explored an array of evaluation methods and benchmarks to assess your refined model. The RLHF process ensures your models align with human values, making them helpful, honest, and safe. In the upcoming chapter, we’ll tackle RAG methods paired with vector databases – an essential technique to ground your enterprise data and minimize hallucinations in LLM-driven applications.

A deep dive into vector DB essentials – RAGs to Riches: Elevating AI with External Data

A deep dive into vector DB essentials

To fully comprehend RAG, it’s imperative to understand vector DBs because RAG relies heavily on its efficient data retrieval for query resolution. A vector DB is a database designed to store and efficiently query highly dimensional vectors and is often used in similarity searches and machine learning tasks. The design and mechanics of vector DBs directly influence the effectiveness and accuracy of RAG answers.

In this section, we will cover the fundamental components of vector DBs (vectors and vector embeddings), and in the next section, we will dive deeper into the important characteristics of vector DBs that enable a RAG-based generative AI solution. We will also explain how it differs from regular databases and then tie it all back to explain RAG.

Vectors and vector embeddings

A vector is a mathematical object that has both magnitude and direction and can be represented by an ordered list of numbers. In a more general sense, especially in computer science and machine learning, a vector can be thought of as an array or list of numbers that represents a point in a certain

dimensional space. For instances depicted in the following image, in 2D space (on the left), a vector might be represented as [x, y], whereas in 3D space (on the right), it might be [x, y, z]:

Figure 4.2 – Representation of vectors in 2D and 3D space

Vector embedding refers to the representation of objects, such as words, sentences, or even entire documents, as vectors in a highly dimensional space. A highly dimensional space denotes a mathematical space with more than three dimensions, frequently used in data analysis and machine learning to represent intricate data structures. Think of it as a room where you can move in more than three directions, facilitating the description and analysis of complex data. The embedding process converts words, sentences, or documents into vector representations, capturing the intricate semantic relationships between them. Hence, words with similar meanings tend to be close to each other in the highly dimensional space. Now, you must be wondering how this plays a role in designing generative AI solutions consisting of LLMs. Vector embeddings provide the foundational representation of data. They are a standardized numerical representation for diverse types of data, which LLMs use to process and generate information. Such an embedding process to convert words and sentences to a numerical representation is initiated by embedding models such as OpenAI’s text-embedding-ada-002. Let’s explain this with an example.

When to Use HNSW vs. FAISS – RAGs to Riches: Elevating AI with External Data

When to Use HNSW vs. FAISS

Use HNSW when:

  • High precision in similarity search is crucial.
  • The dataset size is large but not at the scale where managing it becomes impractical for HNSW.
  • Real-time or near-real-time search performance is required.
  • The dataset is dynamic, with frequent updates or insertions.
  • Apt for use cases involving text like article recommendation systems

Use FAISS when:

  • Managing extremely large datasets (e.g., billions of vectors).
  • Batch processing and GPU optimization can significantly benefit the application.
  • There’s a need for flexible trade-offs between search speed and accuracy.
  • The dataset is relatively static, or batch updates are acceptable.
  • Apt for use cases like image and video search.

Note

Choosing the right indexing strategy hinges on several critical factors, including the nature and structure of the data, the types of queries (e.g. range queries, nearest neighbors, exact search) to be supported, and the volume and growth of the data. Additionally, the frequency of data updates (e.g., static vs dynamic) the dimensionality of the data, performance requirements (real-time, batch), and resource constraints play significant roles in the decision-making process.

Similarity measures

Similarity measures dictate how the index is organized, and this makes sure that the retrieved data are highly relevant to the query. For instance, in a system designed to retrieve similar images, the index might be built around the feature vectors of images, and the similarity measure would determine which images are “close” or “far” within that indexed space. The importance of these concepts is two-fold: indexing significantly speeds up data retrieval, and similarity measures ensure that the retrieved data is relevant to the query, together enhancing the efficiency and efficacy of data retrieval systems. Selecting an appropriate distance metric greatly enhances the performance of classification and clustering tasks. The optimal similarity measure is chosen based on the nature of the data input.

In other words, similarity measures define how closely two items or data points are related. They can be broadly classified into distance metrics and similarity metrics. Next, we’ll explore the three top similarity metrics for building AI applications: cosine similarity and Euclidean and Manhattan distance.

  • Similarity metrics – Cosine similarity: Cosine similarity, a type of similarity metric, calculates the cosine value of the angle between two vectors, and OpenAI suggests using it for its models to measure the distance between two embeddings obtained from text-embedding-ada-002. The higher the metric, the more similar they are:

Figure 4.6 – Illustration of relatedness through cosine similarity between two words

The preceding image shows a situation where the cosine similarity is 1 for India and the USA because they are related, as both are countries. In the other image, the similarity is 0 because football is not similar to a lion.

  • Distance metrics – Euclidean (L2): Euclidean distance computes the straight-line distance between two points in Euclidean space. The higher the metric, the less similar the two points are:

Figure 4.7 – Illustration of Euclidean distance

Vector DB sample scenario – Music recommendation system using a vector database – RAGs to Riches: Elevating AI with External Data

Vector DB sample scenario – Music recommendation system using a vector database

Let’s consider a music streaming platform aiming to provide song recommendations based on a user’s current listening. Imagine a user who is listening to “Song X” on the platform.

Behind the scenes, every song in the platform’s library is represented as a highly dimensional vector based on its musical features and content, using embeddings. “Song X” also has its vector representation. When the system aims to recommend songs similar to “Song X,” it doesn’t look for exact matches (as traditional databases might). Instead, it leverages a vector DB to search for songs with vectors closely resembling that of “Song X.” Using an ANN search strategy, the system quickly sifts through millions of song vectors to find those that are approximately nearest to the vector of “Song X.” Once potential song vectors are identified, the system employs similarity measures, such as cosine similarity, to rank these songs based on how close their vectors are to “Song X’s” vector. The top-ranked songs are then recommended to the user.

Within milliseconds, the user gets a list of songs that musically resemble “Song X,” providing a seamless and personalized listening experience. All this rapid, similarity-based recommendation magic is powered by the vector database’s specialized capabilities.

Common vector DB applications

  • Image and video similarity search: In the context of image and video similarity search, a vector DB specializes in efficiently storing and querying highly dimensional embeddings derived from multimedia content. By processing images through deep learning models, they are converted into feature vectors, a.k.a embeddings, that capture their essential characteristics. When it comes to videos, an additional step may need to be carried out to extract frames and then convert them into vector embeddings. Contrastive language-image pre-training (CLIP) from OpenAI is a very popular choice for embedding videos and images. These vector embeddings are indexed in the vector DB, allowing for rapid and precise retrieval when a user submits a query. This mechanism powers applications such as reverse image and video search, content recommendations, and duplicate detection by comparing and ranking content based on the proximity of their embeddings.
  • Voice recognition: Voice recognition with vectors is akin to video vectorization. Analog audio is digitized into short frames, each representing an audio segment. These frames are processed and stored as feature vectors, with the entire audio sequence representing things such as spoken sentences or songs. For user authentication, a vectorized spoken key phrase might be compared to stored recordings. In conversational agents, these vector sequences can be inputted into neural networks to recognize and classify spoken words in speech and generate responses, similar to ChatGPT.
  • Long-term memory for chatbots: Virtual database management systems (VDBMs) can be employed to enhance the long-term memory capabilities of chatbots or generative models. Many

generative models can only process a limited amount of preceding text in prompt responses, which results in their inability to recall details from prolonged conversations. As these models don’t have inherent memory of past interactions and can’t differentiate between factual data and user-specific details, using VDBMs can provide a solution for storing, indexing, and referencing previous interactions to improve consistency and context-awareness in responses.

This is a very important use case and plays a key role in implementing RAG, which we will discuss in the next section.

The role of vector DBs in retrieval-augmented generation (RAG) – RAGs to Riches: Elevating AI with External Data

The role of vector DBs in retrieval-augmented generation (RAG)

To fully understand RAG and the pivotal role of vector DBs within it, we must first acknowledge the inherent constraints of LLMs, which paved the way for the advent of RAG techniques powered by vector DBs. This section sheds light on the specific LLM challenges that RAG aims to overcome and the importance of vector DBs.

First, the big question – Why?

In Chapter 1, we delved into the limitations of LLMs, which include the following:

  • LLMs possess a fixed knowledge base determined by their training data; as of February 2024, ChatGPT’s knowledge is limited to information up until April 2023.
  • LLMs can occasionally produce false narratives, spinning tales or facts that aren’t real.
  • They lack personal memory, relying solely on the input context length. For example, take GPT4-32K; it can only process up to 32K tokens between prompts and completions (we’ll dive deeper into prompts, completions, and tokens in Chapter 5).

To counter these challenges, a promising avenue is enhancing LLM generation with retrieval components. These components can extract pertinent data from external knowledge bases—a process termed RAG, which we’ll explore further in this section.

So, what is RAG, and how does it help LLMs?

Retrievalaugmented generation (RAG) was first introduced in a paper titled Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (https://arxiv.org/pdf/2005.11401. pdf) in November 2020 by Facebook AI Research (now Meta). RAG is an approach that combines the generative capabilities of LLMs with retrieval mechanisms to extract relevant information from vast datasets. LLMs, such as the GPT variants, have the ability to generate human-like text based on patterns in their training data but lack the means to perform real-time external lookups or reference specific external knowledge bases post-training. RAG addresses this limitation by using a retrieval model to query a dataset and fetch relevant information, which then serves as the context for the generative model to produce a detailed and informed response. This also helps in grounding the LLM queries with relevant information that reduces the chances of hallucinations.

The critical role of vector DBs

A vector DB plays a crucial role in facilitating the efficient retrieval aspect of RAG. In this setup, each piece of information, such as text, video, or audio, in the dataset is represented as a highly dimensional vector and indexed in a vector DB. When a query from a user comes in, it’s also converted into a similar vector representation. The vector DB then rapidly searches for vectors (documents) in the dataset that are closest to the query vector, leveraging techniques such as ANN search. Then, it attaches the query with relevant content and sends it to the LLMs to generate a response. This ensures that the most relevant information is retrieved quickly and efficiently, providing a foundation for the generative model to build upon.

Example of an RAG workflow – RAGs to Riches: Elevating AI with External Data

Example of an RAG workflow

Let’s walk through as an example step by step, as shown in the image. Imagine a platform where users can ask about ongoing cricket matches, including recent performances, statistics, and trivia:

  1. Suppose the user asks, “How did Virat Kohli perform in the last match, and what’s an interesting fact from that game?” Since the LLM was trained until April 2023, the LLM may not have this answer.
  2. The retrieval model will embed the query and send it to a vector DB.
  3. All the latest cricket news is stored in a vector DB in a properly indexed format using ANN strategies such as HNSW. The vector DB performs a cosine similarity with the indexed information and provides a few relevant results or contexts.
  4. The retrieved context is then sent to the LLM along with the query to synthesize the information and provide a relevant answer.
  5. The LLM provides the relevant answer: “Virat Kohli scored 85 runs off 70 balls in the last match. An intriguing detail from that game is that it was the first time in three years that he hit more than seven boundaries in an ODI inning.”

The following image illustrates the preceding points:

Figure 4.11 – Representation of RAG workflow with vector database

Business applications of RAG

In the following list, we have mentioned a few popular business applications of RAG based on what we’ve seen in the industry:

  • Enterprise search engines: One of the most prominent applications of RAG is in the realm of enterprise learning and development, serving as a search engine for employee upskilling. Employees can pose questions about the company, its culture, or specific tools, and RAG swiftly delivers accurate and relevant answers.
  • Legal and compliance: RAG fetches relevant case laws or checks business practices against regulations.
  • Ecommerce: RAG suggests products or summarizes reviews based on user behavior and queries.
  • Customer support: RAG provides precise answers to customer queries by pulling information from the company’s knowledge base and providing solutions in real time.
  • Medical and healthcare: RAG retrieves pertinent medical research or provides preliminary symptom-based suggestions.

The essentials of prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

The essentials of prompt engineering

Before discussing prompt engineering, it is important to first understand the foundational components of a prompt. In this section, we’ll delve into the key components of a prompt, such as ChatGPT prompts, completions, and tokens. Additionally, grasping what tokens are is pivotal to understanding the model’s constraints and managing costs.

ChatGPT prompts and completions

A prompt is an input provided to LLMs, whereas completions refer to the output of LLMs. The structure and content of a prompt can vary based on the type of LLM (e.g., the text or image generation model), specific use cases, and the desired output of the language model.

Completions refer to the response generated by ChatGPT prompts; basically, it is an answer to your questions. Check out the following example to understand the difference between prompts and completions when we prompt ChatGPT with, “What is the capital of India?”

Figure 5.2 – An image showing a sample LLM prompt and completion

Based on the use case, we can leverage one of the two ChatGPT API calls, named Completions or ChatCompletions, to interact with the model. However, OpenAI recommends using the ChatCompletions API in the majority of scenarios.

Completions API

The Completions API is designed to generate creative, free-form text. You provide a prompt, and the API generates text that continues from it. This is often used for tasks where you want the model to answer a question or generate creative text, such as for writing an article or a poem.

ChatCompletions API

The ChatCompletions API is designed for multi-turn conversations. You send a series of messages instead of a single prompt, and the model generates a message as a response. The messages sent to the model include a role (which can be a system, user, or assistant) and the content of the message. The system role is used to set the behavior of the assistant, the user role is used to instruct the assistant, and the model’s responses are under the assistant role.

The following is an example of a sample ChatCompletions API call:

import openai

openai.api_key = ‘your-api-key’

response = openai.ChatCompletion.create(

model=”gpt-3.5-turbo”,

messages=[

{“role”: “system”, “content”: “You are a helpful sports \

assistant.”},

{“role”: “user”, “content”: “Who won the cricket world cup \

in 2011?”},

{“role”: “assistant”, “content”: “India won the cricket \

world cup in 2011″},

{“role”: “assistant”, “content”: “Where was it played”}

]

)

print(response[‘choices’][0][‘message’][‘content’])

The main difference between the Completions API and ChatCompletions API is that the Completions API is designed for single-turn tasks, while the ChatCompletions API is designed to handle multiple turns in a conversation, making it more suitable for building conversational agents. However, the ChatCompletions API format can be modified to behave as a Completions API by using a single user message.

Important note

The CompletionsAPI, launched in June 2020, initially offered a freeform text interface for Open AI’s language models. However, experience has shown that structured prompts often yield better outcomes. The chat-based approach, especially through the ChatCompletions API, excels in addressing a wide array of needs, offering enhanced flexibility and specificity and reducing prompt injection risks. Its design supports multi-turn conversations and a variety of tasks, enabling developers to create advanced conversational experiences. Hence, Open AI announced that they would be deprecating some of the older models using Completions API and, in moving forward, they would be investing in the ChatCompletions API to optimize their efforts to use compute capacity. While the Completions API will remain accessible, it shall be labeled as “legacy” in the Open AI developer documentation.

Tokens

Understanding the concepts of tokens is essential, as it helps us better comprehend the restrictions, such as model limitations, and the aspect of cost management when utilizing ChatGPT.

A ChatGPT token is a unit of text that ChatGPT’s language model uses to understand and generate language. In ChatGPT, a token is a sequence of characters that the model uses to generate new sequences

of tokens and form a coherent response to a given prompt. The models use tokens to represent words, phrases, and other language elements. The tokens are not cut where the word starts or ends but can consist of trailing spaces, sub words and punctuations, too.

As stated on the OpenAI website, tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens.

To understand tokens in terms of lengths, the following is used as a rule of thumb:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words
  • 1–2 sentences ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens
  • 1 US page (8 ½” x 11”) ~= 450 tokens (assuming ~1800 characters per page)

For example, this famous quote from Thomas Edison (“Genius is one percent inspiration and ninety-nine percent perspiration.”) has 14 tokens:

Figure 5.3 – Tokenization of sentence

We used the OpenAI Tokenizer tool to calculate the tokens; the tool can be found at https:// platform.openai.com/tokenizer. An alternative way to tokenize text (programmatically) is to use the Tiktoken library on Github; this can be found at https://github.com/openai/

tiktoken.

Token limits in ChatGPT models – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Token limits in ChatGPT models

Depending on the model, the token limits on the model will vary. As of Feb 2024, the token limit for the family of GPT-4 models ranges from 8,192 to 128,000 tokens. This means the sum of prompt and completion tokens for an API call cannot exceed 32,768 tokens for the GPT-4-32K model. If the prompt is 30,000 tokens, the response cannot be more than 2,768 tokens. The GPT4-Turbo 128K is the most recent model as of Feb 2024, with 128,000 tokens, which is close to 300 pages of text in a single prompt and completion. This is a massive context prompt compared to its predecessor models.

Though this can be a technical limitation, there are creative ways to address the problem of limitation, such as using chunking and condensing your prompts. We discussed chunking strategies in Chapter 4, which can help you address token limitations.

The following figure shows various models and token limits:

Model Token Limit

GPT-3.5-turbo4,096
GPT-3.5-turbo-16k16,384
GPT-3.5-turbo-06134,096
GPT-3.5-turbo-16k-061316,384
GPT-48,192
GPT-4-061332,768
GPT-4-32K32,768
GPT-4-32-061332,768
GPT-4-Turbo 128K128,000

Figure 5.4 – Models and associated Token Limits

For the latest updates on model limits for newer versions of models, please check the OpenAI website.

Tokens and cost considerations

The cost of using ChatGPT or similar models via an API is often tied to the number of tokens processed, encompassing both the input prompts and the model’s generated responses.

In terms of pricing, providers typically have a per-token charge, leading to a direct correlation between conversation length and cost; the more tokens processed, the higher the cost. The latest cost updates can be found on the OpenAI website.

From an optimization perspective, understanding this cost-token relationship can guide more efficient API usage. For instance, creating more succinct prompts and configuring the model for brief yet effective responses can help control token count and, consequently, manage expenses.

We hope you now have a good understanding of the key components of a prompt. Now, you are ready to learn about prompt engineering. In the next section, we will explore the details of prompt engineering and effective strategies, enabling you to maximize the potential of your prompt contents through the one-shot and few-shot learning approaches.

What is prompt engineering? – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

What is prompt engineering?

Prompt engineering is the art of crafting or designing prompts to unlock desired outcomes from large language models or AI systems. The concept of prompt engineering revolves around the fundamental idea that the quality of your response is intricately tied to the quality of the question you pose. By strategically engineering prompts, one can influence the generated outputs and improve the overall performance and usefulness of the system. In this section, we will learn about the necessary elements of effective prompt design, prompt engineering techniques, best practices, bonus tips, and tricks.

Elements of a good prompt design

Designing a good prompt is important because it significantly influences the output of a language model such as GPT. The prompt provides the initial context, sets the task, guides the style and structure of the response, reduces ambiguities and hallucinations, and supports the optimization of resources, thereby reducing costs and energy use. In this section, let’s understand the elements of good prompt design.

The foundational elements of a good prompt include instructions, questions, input data, and examples:

  • Instructions: The instructions in a prompt refer to the specific guidelines or directions given to a language model within the input text to guide the kind of response it should produce.
  • Questions: Questions in a prompt refer to queries or interrogative statements that are included in the input text. The purpose of these questions is to instruct the language model to provide a response or an answer to the query. In order to obtain the results, either the question or instruction is mandatory.
  • Input data: The purpose of input data is to provide any additional supporting context when prompting the LLM. It could be used to provide new information the model has not previously been trained on for more personalized experiences.
  • Examples: The purpose of examples in a prompt is to provide specific instances or scenarios that illustrate the desired behavior or response from ChatGPT. You can input a prompt that includes one or more examples, typically in the form of input-output pairs.

The following table shows how to build effective prompts using the aforementioned prompt elements:

Figure 5.5 – Sample Prompt formula consisting of prompt elements with examples

Prompt parameters – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Prompt parameters

ChatGPT prompt parameters are variables that you can set in the API calls. They allow users to influence the model’s output, customizing the behavior of the model to better fit specific applications or contexts. The following table shows some of the most important parameters of a ChatGPT API call:

Figure 5.6 – Essential Prompt Parameters

In this section, only the top parameters for building an effective prompt are highlighted. For a full list of parameters, refer to the OpenAI API reference (https://platform.openai.com/docs/ api-reference).

ChatGPT roles

System message

This is the part where youdesign your metaprompts. Metaprompts help to set the initial context, theme, and behavior of the ChatGPT API to guide the model’s interactions with the user, thus setting roles or response styles for the assistant.

Metaprompts are structured instructions or guidelines that dictate how the system should interpret and respond to user requests. These metaprompts are designed to ensure that the system’s outputs adhere to specific policies, ethical guidelines, or operational rules. They’re essentially “prompts about how to handle prompts,” guiding the system in generating responses, handling data, or interacting with users in a way that aligns with predefined standards.

The following table is a metaprompt framework that you can follow to design the ChatGPT system message:

Figure 5.7 – Elements of a Metaprompt

User

The messages from the user serve as prompts or remarks that the assistant is expected to react to or engage with. what is it establishes the anticipated scope of queries that may come from the user.