Category Example of an RAG workflow

Real-life examples of fine-tuning success – Fine-Tuning – Building Domain-Specific LLM Applications

Real-life examples of fine-tuning success

In this section, we’ll explore a real-life example of a fine-tuning approach that OpenAI implemented, which yielded remarkable outcomes.

InstructGPT

OpenAI’s InstructGPT is one of the most successful stories of fine-tuned models that laid the foundation of ChatGPT. ChatGPT is said to be a sibling model to InstructGPT. The methods that are used to fine-tune ChatGPT are similar to InstructGPT. InstructGPT was created by fine-tuning pre-trained GPT-3 models with RHLF. Supervised fine-tuning is the first step in RLHF for generating responses aligned to human preferences.

In the beginning, GPT-3 models weren’t originally designed to adhere to user instructions. Their training focused on predicting the next word based on vast amounts of internet text data. Therefore, these models underwent fine-tuning using instructional datasets along with RLHF to enhance their ability to generate more useful and relevant responses aligned with human values when prompted with user instructions:

Figure 3.20 – The fine-tuning process with RLHF

This figure depicts a schematic representation showcasing the InstructGPT fine- tuning process: (1) initial supervised fine-tuning, (2) training the reward model, and (3) executing RL through PPO using this established reward model. The utilization of this data to train respective models is indicated by the presence of blue arrows. In step 2, boxes A-D are samples from models that get ranked by labelers.

The following figure provides a comparison of the response quality of fine-tuned models with RLHF, supervised fine-tuned models, and general GPT models. The Y-axis consists of a Likert scale and shows quality ratings of model outputs on a 1–7 scale (Y-axis), for various model sizes (X-axis), on prompts submitted to InstructGPT models via the OpenAI API. The results reveal that InstructGPT outputs receive significantly higher scores by labelers compared to outputs from GPT-3 models with both few-shot prompts and those without, as well as models that underwent supervised learning fine-tuning. The labelers that were hired for this work were independent and were sourced from Scale AI and Upwork:

Figure 3.21 – Evaluation of InstructGPT (image credits: Open AI)

InstructGPT can be assessed across dimensions of toxicity, truthfulness, and appropriateness. Higher scores are desirable for TruthfulQA and appropriateness, whereas lower scores are preferred for toxicity and hallucinations. Measurement of hallucinations and appropriateness is conducted based on the distribution of prompts within our API. The outcomes are aggregated across various model sizes:

Figure 3.22 – Evaluation of InstructGPT

In this section, we introduced the concept of fine-tuning and discussed a success stories of fine-tuning with RLHF that led to the development of InstructGPT.

Summary

Fine-tuning is a powerful technique for customizing models, but it may not always be necessary. As observed, it can be time-consuming and may have initial upfront costs. It’s advisable to start with easier and faster strategies, such as prompt engineering with few- shot examples, followed by data grounding using RAG. Only if the responses from the LLM remain suboptimal should you consider fine-tuning. We will discuss RAG and prompt engineering in the following chapters.

In this chapter, we delved into critical fine-tuning strategies tailored for specific tasks. Then, we explored an array of evaluation methods and benchmarks to assess your refined model. The RLHF process ensures your models align with human values, making them helpful, honest, and safe. In the upcoming chapter, we’ll tackle RAG methods paired with vector databases – an essential technique to ground your enterprise data and minimize hallucinations in LLM-driven applications.

A deep dive into vector DB essentials 2 – RAGs to Riches: Elevating AI with External Data

The following image visually represents the clustering of mammals and birds in a two-dimensional vector embedding space, differentiating between their realistic and cartoonish portrayals. This image depicts a spectrum between “REALISTIC” and “CARTOON” representations, further categorized into “MAMMAL” and “BIRD.” On the realistic side, there’s a depiction of a mammal (elk) and three birds (an owl, an eagle, and a small bird). On the cartoon side, there are stylized and whimsical cartoon versions of mammals and birds, including a comically depicted deer, an owl, and an exaggerated bird character. LLMs use such vector embedding spaces, which are numerical representations of objects in highly dimensional spaces, to understand, process, and generate information. For example, imagine an educational application designed to teach children about wildlife. If a student prompts the chatbot to provide images of birds in a cartoon representation, the LLM will search and generate information from the bottom right quadrant:

Figure 4.3 – Location of animals with similar characteristics in a highly

dimensional space, demonstrating “relatedness”

Now, let’s delve into the evolution of embedding models that produce embeddings, a.k.a numerical representations of objects, within highly dimensional spaces. Embedding models have experienced significant evolution, transitioning from the initial methods that mapped discrete words to dense vectors, such as word-to-vector (Word2Vec), global vectors for word representation (GloVe), and FastText to more sophisticated contextual embeddings using deep learning architectures. These newer models, such as embeddings from language models (ELMos), utilize long short-term memory (LSTM)-based structures to offer context-specific representations. The newer transformer architecture-based embedding models, which underpin models such as bidirectional encoder representations from transformers (BERT), generative pre-trained transformer (GPT), and their subsequent iterations, marked a revolutionary leap over predecessor models.

These models capture contextual information in unparalleled depth, enabling embeddings to represent nuances in word meanings based on the surrounding context, thereby setting new standards in various natural language processing tasks.

Important note:

In Jan 2024, OpenAI announced two third-generation embedding models, text-embedding-3-small and text-embedding-3-large, which are the newest models that have better performance,lower costs, and better multi -lingual retrieval and parameters to reduce the overall size of dimensions when compared to predecessor second-generation model, text-embedding-ada-002. Another key difference is the number of dimensions between the two generations. The third-generation models come in different dimensions, and the highest they can go up to is 3,072. As of Jan 2024, we have seen more production workloads using text-embedding-ada-002 in production, which has 1,536 dimensions. OpenAI recommends using the third-generation models going forward for improved performance and reduced costs.

We also wanted you to know that while OpenAI’s embedding model is one of the most popular choices when it comes to text embeddings, you can find the list of leading embedding models on Hugging Face (https://huggingface.co/spaces/mteb/leaderboard).

The following snippet of code gives an example of generating Azure OpenAI endpoints:

import openai

openai.api_type = “azure”

openai.api_key = YOUR_API_KEY

openai.api_base = “https://YOUR_RESOURCE_NAME.openai.azure.com” openai.api_version = “YYYY-MM-DD” ##Replace with latest version

response = openai.Embedding.create (

input=”Your text string goes here”,

engine=”YOUR_DEPLOYMENT_NAME”

)

embeddings = response[‘data’][0][’embedding’] print(embeddings)

In this section, we highlighted the significance of vector embeddings. However, their true value emerges when used effectively. Hence, we’ll now dive deep into indexing and vector search strategies, which are crucial for optimal data retrieval in the RAG workflow.

When to Use HNSW vs. FAISS – RAGs to Riches: Elevating AI with External Data

When to Use HNSW vs. FAISS

Use HNSW when:

  • High precision in similarity search is crucial.
  • The dataset size is large but not at the scale where managing it becomes impractical for HNSW.
  • Real-time or near-real-time search performance is required.
  • The dataset is dynamic, with frequent updates or insertions.
  • Apt for use cases involving text like article recommendation systems

Use FAISS when:

  • Managing extremely large datasets (e.g., billions of vectors).
  • Batch processing and GPU optimization can significantly benefit the application.
  • There’s a need for flexible trade-offs between search speed and accuracy.
  • The dataset is relatively static, or batch updates are acceptable.
  • Apt for use cases like image and video search.

Note

Choosing the right indexing strategy hinges on several critical factors, including the nature and structure of the data, the types of queries (e.g. range queries, nearest neighbors, exact search) to be supported, and the volume and growth of the data. Additionally, the frequency of data updates (e.g., static vs dynamic) the dimensionality of the data, performance requirements (real-time, batch), and resource constraints play significant roles in the decision-making process.

Similarity measures

Similarity measures dictate how the index is organized, and this makes sure that the retrieved data are highly relevant to the query. For instance, in a system designed to retrieve similar images, the index might be built around the feature vectors of images, and the similarity measure would determine which images are “close” or “far” within that indexed space. The importance of these concepts is two-fold: indexing significantly speeds up data retrieval, and similarity measures ensure that the retrieved data is relevant to the query, together enhancing the efficiency and efficacy of data retrieval systems. Selecting an appropriate distance metric greatly enhances the performance of classification and clustering tasks. The optimal similarity measure is chosen based on the nature of the data input.

In other words, similarity measures define how closely two items or data points are related. They can be broadly classified into distance metrics and similarity metrics. Next, we’ll explore the three top similarity metrics for building AI applications: cosine similarity and Euclidean and Manhattan distance.

  • Similarity metrics – Cosine similarity: Cosine similarity, a type of similarity metric, calculates the cosine value of the angle between two vectors, and OpenAI suggests using it for its models to measure the distance between two embeddings obtained from text-embedding-ada-002. The higher the metric, the more similar they are:

Figure 4.6 – Illustration of relatedness through cosine similarity between two words

The preceding image shows a situation where the cosine similarity is 1 for India and the USA because they are related, as both are countries. In the other image, the similarity is 0 because football is not similar to a lion.

  • Distance metrics – Euclidean (L2): Euclidean distance computes the straight-line distance between two points in Euclidean space. The higher the metric, the less similar the two points are:

Figure 4.7 – Illustration of Euclidean distance

Recommendation System for Articles – RAGs to Riches: Elevating AI with External Data

Recommendation System for Articles

Let’s consider a scenario where a news aggregation platform aims to recommend articles similar to what a user is currently reading, enhancing user engagement by suggesting relevant content.

How It Works:

  • Preprocessing and Indexing: Articles in the platform’s database are processed to extract textual features, often converted into high-dimensional vectors using LDA or transformer based embeddings like text-ada-embedding-002. These vectors are then indexed using HNSW, an algorithm suitable for high-dimensional spaces due to its hierarchical structure that facilitates efficient navigation and search.
  • Retrieval Time: When a user reads an article, the system generates a feature vector for this article and queries the HNSW index to find vectors (and thus articles) that are close in the high-dimensional space. Cosine similarity can be used to evaluate the similarity between the query article’s vector and those in the index, identifying articles with similar content.
  • Outcome: The system recommends a list of articles ranked by their relevance to the currently viewed article. Thanks to the efficient indexing and similarity search, these recommendations are generated quickly, even from a vast database of articles, providing the user with a seamless experience.

Now let us walkthrough a scenario where Manhattan Distance will be preferred over Cosine Similarity.

Ride-Sharing App Matchmaking

Let’s consider a scenario where a ride-sharing application needs to match passengers with nearby drivers efficiently. The system must quickly find the closest available drivers to a passenger’s location to minimize wait times and optimize routes.

How It Works:

  • Preprocessing and Indexing: Drivers’ current locations are constantly being updated and stored as points in a 2D space representing a map. These points can be indexed using a tree based spatial indexing techniques or data structures optimized for geospatial data, such as R-trees.
  • Retrieval Time: When a passenger requests a ride, the application uses the passenger’s current location as a query point. Manhattan distance (L1 norm) is particularly suitable for urban environments, where movement is constrained by a grid-like structure of streets and avenues, mimicking the actual paths a car would take along city blocks.
  • Outcome: The system quickly identifies the nearest available drivers using the indexed data and Manhattan distance calculations, considering the urban grid’s constraints. This process

ensures a swift  matchmaking process, improving the user experience by reducing wait times.

Vector stores – RAGs to Riches: Elevating AI with External Data

Vector stores

As generative AI applications continue to push the boundaries of what’s possible in tech, vector stores have emerged as a crucial component, streamlining and optimizing the search and retrieval of relevant data. In our previous discussions, we’ve delved into the advantages of vector DBs over traditional databases, unraveling the concepts of vectors, embeddings, vector search strategies, approximate nearest neighbors (ANNs), and similarity measures. In this section, we aim to provide an integrative understanding of these concepts within the realm of vector DBs and libraries.

The image illustrates a workflow for transforming different types of data—Audio, Text, and Videos— into vector embeddings.

  • Audio: An audio input is processed through an “Audio Embedding model,” resulting in “Audio vector embeddings.”
  • Text: Textual data undergoes processing in a “Text Embedding model,” leading to “Text vector embeddings.”
  • Videos: Video content is processed using a “Video Embedding model,” generating “Video vector embeddings.”

Once these embeddings are created, they are subsequently utilized (potentially in an enterprise vector database system) to perform “Similarity Search” operations. This implies that the vector embeddings can be compared to find similarities, making them valuable for tasks such as content recommendations, data retrieval, and more.

Figure 4.9 – Multimodal embeddings process in an AI application

What is a vector database?

A vector database (vector DB) is a specialized database designed to handle highly dimensional vectors primarily generated from embeddings of complex data types such as text, images, or audio. It provides capabilities to store and index unstructured data and enhance searches, as well as retrieval capabilities as a service.

Modern vector databases that are brimming with advancements empower you to architect resilient enterprise solutions. Here, we list 15 key features to consider when choosing a vector DB. Every feature may not be important for your use case, but it might be a good place to start. Keep in mind that this area is changing fast, so there might be more features emerging in the future:

  • Indexing: As mentioned earlier, indexing refers to the process of organizing highly dimensional vectors in a way that allows for efficient similarity searches and retrievals. A vector DB offers built-in indexing features designed to arrange highly dimensional vectors for swift and effective similarity-based searches and retrievals. Previously, we discussed indexing algorithms such as FAISS and HNSW. Many vector DBs incorporate such features natively. For instance, Azure AI Search integrates the HNSW indexing service directly.
  • Search and retrieval: Instead of relying on exact matches, as traditional databases do, vector DBs provide vector search capabilities as a service, such as approximate nearest neighbors (ANNs), to quickly find vectors that are roughly the closest to a given input. To quantify the closeness or similarity between vectors, they utilize similarity measures such as cosine similarity or Euclidean distance, enabling efficient and nuanced similarity-based searches in large datasets.
  • Create, read, update, and delete: A vector DB manages highly dimensional vectors and offers create, read, update, and delete (CRUD) operations tailored to vectorized data. When vectors are created, they’re indexed for efficient retrieval. Reading often means performing similarity searches to retrieve vectors closest to a given query vector, typically using methods such as ANNs. Vectors can be updated, necessitating potential re-indexing, and they can also be deleted, with the database adjusting its internal structures accordingly to maintain efficiency and consistency.
  • Security: This meets GDPR, SOC2 Type II, and HIPAA rules to easily manage access to the console and use SSO. Data is encrypted when stored and in transit, which also provides more granular identity and access management features.
  • Serverless: A high-quality vector database is designed to gracefully autoscale with low management overhead as data volumes soar into millions or billions of entries, distributing seamlessly across several nodes. Optimal vector databases grant users the flexibility to adjust the system in response to shifts in data insertion, query frequencies, and underlying hardware configurations.
  • Hybrid search: Hybrid search combines traditional keyword-based search methods with other search mechanisms, such as semantic or contextual search, to retrieve results from both the exact term matches and by understanding the underlying intent or context of the query, ensuring a more comprehensive and relevant set of results.
  • Semantic re-ranking: This is a secondary ranking step to improve the relevance of search results. It re-ranks the search results that were initially scored by state-of-the-art ranking algorithms such as BM25 and RRF based on language understanding. For instance, Azure AI search employs secondary ranking that uses multi-lingual, deep learning models derived from Microsoft Bing to elevate the results that are most relevant in terms of meaning.
  • Auto vectorization/embedding: Auto-embedding in a vector database refers to the automatic process of converting data items into vector representations for efficient similarity searches and retrieval, with access to multiple embedding models.
  • Data replication: This ensuresdata availability, redundancy, and recovery in case of failures, safeguarding business continuity and reducing data loss risks.
  • Concurrent user access and data isolation: Vector databases support a large number of users concurrently and ensure robust data isolation to ensure updates remain private unless deliberately shared.
  • Auto-chunking: Auto-chunking is the automated process of dividing a larger set of data or content into smaller, manageable pieces or chunks for easier processing or understanding. This process helps preserve the semantic relevance of texts and addresses the token limitations of embedding models. We will learn more about chunking strategies in the upcoming sections in this chapter.
  • Extensive interaction tools: Prominent vector databases, such as Pinecone, offer versatile APIs and SDKs across languages, ensuring adaptability in integration and management.
  • Easy integration: Vector DBs provide seamless integration with LLM orchestration frameworks and SDKs, such as Langchain and Semantic Kernel, and leading cloud providers, such as Azure, GCP, and AWS.
  • User-friendly interface: Thisensures an intuitive platform with simple navigation and direct feature access, streamlining the user experience.
  • Flexible pricing models: Provides flexible pricing models as per user needs to keep the costs low for the user.
  • Low downtime and high resiliency: Resiliency in a vector database (or any database) refers to its ability to recover quickly from failures, maintain data integrity, and ensure continuous availability even in the face of adverse conditions, such as hardware malfunctions, software bugs, or other unexpected disruptions.

As of early 2024, a few prominent open source vector databases include Chroma, Milvus, Quadrant, and Weaviate, while Pinecone and Azure AI search are among the leading proprietary solutions.

Vector DB limitations – RAGs to Riches: Elevating AI with External Data

Vector DB limitations

  • Accuracy vs. speed trade-off: When dealing with highly dimensional data, vector DBs often face a trade-off between speed and accuracy for similarity searches. The core challenge stems from the computational expense of searching for the exact nearest neighbors in large datasets. To enhance search speed, techniques such as ANNs are employed, which quickly identify “close enough” vectors rather than the exact matches. While ANN methods can dramatically boost query speeds, they may sometimes sacrifice pinpoint accuracy, potentially missing the true nearest vectors. Certain vector index methods, such as product quantization, enhance storage efficiency and accelerate queries by condensing and consolidating data at the expense of accuracy.
  • Quality of embedding: The effectiveness of a vector database is dependent on the quality of the vector embedding used. Poorly designed embeddings can lead to inaccurate search results or missed connections.
  • Complexity: Implementing and managing vector databases can be complex, requiring specialized knowledge about vector search strategy indexing and chunking strategies to optimize for specific use cases.

Vector libraries

Vector databases may not always be necessary. Small-scale applications may not require all the advanced features that vector DBs provide. In those instances, vector libraries become very valuable. Vector libraries are usually sufficient for small, static data and provide the ability to store in memory, index, and use similarity search strategies. However, they may not provide features such as CRUD support, data replication, and being able to store data on disk, and hence, the user will have to wait for a full import before they can query. Facebook’s FAISS is a popular example of a vector library.

As a rule of thumb, if you are dealing with millions/billions of records and storing data that are changing frequently, require millisecond response times, and more long-term storage capabilities on disk, it is recommended to use vector DBs over vector libraries.

Vector DBs vs. traditional databases – Understanding the key differences

As stated earlier, vector databases have become pivotal, especially in the era of generative AI, because they facilitate efficient storage, querying, and retrieval of highly dimensional vectors that are nothing but numerical representations of words or sentences often produced by deep learning models. Traditional scalar databases are designed to handle discrete and simple data types, making them ill-suited for the complexities of large-scale vector data. In contrast, vector databases are optimized for similar searches in the vector space, enabling the rapid identification of vectors that are “close” or “similar” in highly dimensional spaces. Unlike conventional data models such as relational databases, where queries commonly resemble “retrieve the books borrowed by a particular member” or “identify the items currently discounted,” vector queries primarily seek similarities among vectors based on one or more reference vectors. In other words, queries might look like “identify the top 10 images of dogs similar to the dog in this photo” or “locate the best cafes near my current location.” At retrieval time, vector databases are crucial, as they facilitate the swift and precise retrieval of relevant document embeddings to augment the generation process. This technique is also called RAG, and we will learn more about it in the later sections.

Imagine you have a database of fruit images, and each image is represented by a vector (a list of numbers) that describes its features. Now, let’s say you have a photo of an apple, and you want to find similar fruits in your database. Instead of going through each image individually, you convert your apple photo into a vector using the same method you used for the other fruits. With this apple vector in hand, you search the database to find vectors (and therefore images) that are most similar or closest to your apple vector. The result would likely be other apple images or fruits that look like apples based on the vector representation.

Figure 4.10 – Vector represenation

The essentials of prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

The essentials of prompt engineering

Before discussing prompt engineering, it is important to first understand the foundational components of a prompt. In this section, we’ll delve into the key components of a prompt, such as ChatGPT prompts, completions, and tokens. Additionally, grasping what tokens are is pivotal to understanding the model’s constraints and managing costs.

ChatGPT prompts and completions

A prompt is an input provided to LLMs, whereas completions refer to the output of LLMs. The structure and content of a prompt can vary based on the type of LLM (e.g., the text or image generation model), specific use cases, and the desired output of the language model.

Completions refer to the response generated by ChatGPT prompts; basically, it is an answer to your questions. Check out the following example to understand the difference between prompts and completions when we prompt ChatGPT with, “What is the capital of India?”

Figure 5.2 – An image showing a sample LLM prompt and completion

Based on the use case, we can leverage one of the two ChatGPT API calls, named Completions or ChatCompletions, to interact with the model. However, OpenAI recommends using the ChatCompletions API in the majority of scenarios.

Completions API

The Completions API is designed to generate creative, free-form text. You provide a prompt, and the API generates text that continues from it. This is often used for tasks where you want the model to answer a question or generate creative text, such as for writing an article or a poem.

ChatCompletions API

The ChatCompletions API is designed for multi-turn conversations. You send a series of messages instead of a single prompt, and the model generates a message as a response. The messages sent to the model include a role (which can be a system, user, or assistant) and the content of the message. The system role is used to set the behavior of the assistant, the user role is used to instruct the assistant, and the model’s responses are under the assistant role.

The following is an example of a sample ChatCompletions API call:

import openai

openai.api_key = ‘your-api-key’

response = openai.ChatCompletion.create(

model=”gpt-3.5-turbo”,

messages=[

{“role”: “system”, “content”: “You are a helpful sports \

assistant.”},

{“role”: “user”, “content”: “Who won the cricket world cup \

in 2011?”},

{“role”: “assistant”, “content”: “India won the cricket \

world cup in 2011″},

{“role”: “assistant”, “content”: “Where was it played”}

]

)

print(response[‘choices’][0][‘message’][‘content’])

The main difference between the Completions API and ChatCompletions API is that the Completions API is designed for single-turn tasks, while the ChatCompletions API is designed to handle multiple turns in a conversation, making it more suitable for building conversational agents. However, the ChatCompletions API format can be modified to behave as a Completions API by using a single user message.

Important note

The CompletionsAPI, launched in June 2020, initially offered a freeform text interface for Open AI’s language models. However, experience has shown that structured prompts often yield better outcomes. The chat-based approach, especially through the ChatCompletions API, excels in addressing a wide array of needs, offering enhanced flexibility and specificity and reducing prompt injection risks. Its design supports multi-turn conversations and a variety of tasks, enabling developers to create advanced conversational experiences. Hence, Open AI announced that they would be deprecating some of the older models using Completions API and, in moving forward, they would be investing in the ChatCompletions API to optimize their efforts to use compute capacity. While the Completions API will remain accessible, it shall be labeled as “legacy” in the Open AI developer documentation.

Tokens

Understanding the concepts of tokens is essential, as it helps us better comprehend the restrictions, such as model limitations, and the aspect of cost management when utilizing ChatGPT.

A ChatGPT token is a unit of text that ChatGPT’s language model uses to understand and generate language. In ChatGPT, a token is a sequence of characters that the model uses to generate new sequences

of tokens and form a coherent response to a given prompt. The models use tokens to represent words, phrases, and other language elements. The tokens are not cut where the word starts or ends but can consist of trailing spaces, sub words and punctuations, too.

As stated on the OpenAI website, tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens.

To understand tokens in terms of lengths, the following is used as a rule of thumb:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words
  • 1–2 sentences ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens
  • 1 US page (8 ½” x 11”) ~= 450 tokens (assuming ~1800 characters per page)

For example, this famous quote from Thomas Edison (“Genius is one percent inspiration and ninety-nine percent perspiration.”) has 14 tokens:

Figure 5.3 – Tokenization of sentence

We used the OpenAI Tokenizer tool to calculate the tokens; the tool can be found at https:// platform.openai.com/tokenizer. An alternative way to tokenize text (programmatically) is to use the Tiktoken library on Github; this can be found at https://github.com/openai/

tiktoken.

Token limits in ChatGPT models – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Token limits in ChatGPT models

Depending on the model, the token limits on the model will vary. As of Feb 2024, the token limit for the family of GPT-4 models ranges from 8,192 to 128,000 tokens. This means the sum of prompt and completion tokens for an API call cannot exceed 32,768 tokens for the GPT-4-32K model. If the prompt is 30,000 tokens, the response cannot be more than 2,768 tokens. The GPT4-Turbo 128K is the most recent model as of Feb 2024, with 128,000 tokens, which is close to 300 pages of text in a single prompt and completion. This is a massive context prompt compared to its predecessor models.

Though this can be a technical limitation, there are creative ways to address the problem of limitation, such as using chunking and condensing your prompts. We discussed chunking strategies in Chapter 4, which can help you address token limitations.

The following figure shows various models and token limits:

Model Token Limit

GPT-3.5-turbo4,096
GPT-3.5-turbo-16k16,384
GPT-3.5-turbo-06134,096
GPT-3.5-turbo-16k-061316,384
GPT-48,192
GPT-4-061332,768
GPT-4-32K32,768
GPT-4-32-061332,768
GPT-4-Turbo 128K128,000

Figure 5.4 – Models and associated Token Limits

For the latest updates on model limits for newer versions of models, please check the OpenAI website.

Tokens and cost considerations

The cost of using ChatGPT or similar models via an API is often tied to the number of tokens processed, encompassing both the input prompts and the model’s generated responses.

In terms of pricing, providers typically have a per-token charge, leading to a direct correlation between conversation length and cost; the more tokens processed, the higher the cost. The latest cost updates can be found on the OpenAI website.

From an optimization perspective, understanding this cost-token relationship can guide more efficient API usage. For instance, creating more succinct prompts and configuring the model for brief yet effective responses can help control token count and, consequently, manage expenses.

We hope you now have a good understanding of the key components of a prompt. Now, you are ready to learn about prompt engineering. In the next section, we will explore the details of prompt engineering and effective strategies, enabling you to maximize the potential of your prompt contents through the one-shot and few-shot learning approaches.

What is prompt engineering? – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

What is prompt engineering?

Prompt engineering is the art of crafting or designing prompts to unlock desired outcomes from large language models or AI systems. The concept of prompt engineering revolves around the fundamental idea that the quality of your response is intricately tied to the quality of the question you pose. By strategically engineering prompts, one can influence the generated outputs and improve the overall performance and usefulness of the system. In this section, we will learn about the necessary elements of effective prompt design, prompt engineering techniques, best practices, bonus tips, and tricks.

Elements of a good prompt design

Designing a good prompt is important because it significantly influences the output of a language model such as GPT. The prompt provides the initial context, sets the task, guides the style and structure of the response, reduces ambiguities and hallucinations, and supports the optimization of resources, thereby reducing costs and energy use. In this section, let’s understand the elements of good prompt design.

The foundational elements of a good prompt include instructions, questions, input data, and examples:

  • Instructions: The instructions in a prompt refer to the specific guidelines or directions given to a language model within the input text to guide the kind of response it should produce.
  • Questions: Questions in a prompt refer to queries or interrogative statements that are included in the input text. The purpose of these questions is to instruct the language model to provide a response or an answer to the query. In order to obtain the results, either the question or instruction is mandatory.
  • Input data: The purpose of input data is to provide any additional supporting context when prompting the LLM. It could be used to provide new information the model has not previously been trained on for more personalized experiences.
  • Examples: The purpose of examples in a prompt is to provide specific instances or scenarios that illustrate the desired behavior or response from ChatGPT. You can input a prompt that includes one or more examples, typically in the form of input-output pairs.

The following table shows how to build effective prompts using the aforementioned prompt elements:

Figure 5.5 – Sample Prompt formula consisting of prompt elements with examples

Prompt engineering best practices – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Prompt engineering best practices

In the following list, we outline additional best practices to optimize and enhance your experience with prompt creation:

  • Clarity and precision for accurate responses: Ensure that prompts are clear, concise, and specific, avoiding ambiguity or multiple interpretations:

Figure 5.12 – Best practice: clarity and precision

•   Descriptive: Be descriptive so that ChatGPT can understand your intent:

Figure 5.13 – Best practice: be descriptive

  • Format the output: Mention the format of the output, which can be bullet points, paragraphs, sentences, tables, and languages, such as XML, HTML, and JSON. Use examples to articulate the desired output.
  • Adjust the Temperature and Top_p parameters for creativity: As indicated in the parameters section, modifying the Temperatures and Top_p can significantly influence the variability of the model’s output. In scenarios that call for creativity and imagination, raising the temperature proves beneficial. On the other hand, when dealing with legal applications that demand a reduction in hallucinations, a lower temperature becomes advantageous.
  • Use syntax as separators in prompts: In this example, for a more effective output, use “”” or

### to separate instruction and input data:

Example:

Convert the text below to Spanish

Text: “””

{text input here}

“””

  • Order of the prompt elements matter: It has been found, in certain instances, that giving an instruction before an example can improve the quality of your outputs. Additionally, the order of examples can affect the output of prompts.
  • Use guiding words: Thishelps steer the model toward a specific structure, such as the text highlighted in the following:

Example:

#Create a basic Python function that

#1. Requests the user to enter a temperature in Celsius

#2. Converts the Celsius temperature to Fahrenheit def ctf():

  • Instead of saying what not to provide, give alternative recommendations: Provide an alternative path if ChatGPT is unable to perform a task, such as in the following highlighted message:

Example:

System Message: You are an AI nutrition consultant that provides nutrition consultation based on health and wellness goals of the customer Please note that any questions or inquiries beyond the scope of nutrition consultation will NOT be answered and instead will receive the response: “Sorry! This question falls outside my domain of expertise!”

Customer: How do I invest in 401K?

Nutrition AI Assistant: “Sorry! This question falls outside my domain of expertise!”

  • Provide example-based prompts: This helps the language model learn from specific instances and patterns. Start with a zero-shot, then a few-shot, and if neither of them works, then fine-tune the model.
  • Ask ChatGPT to provide citations/sources: When asking ChatGPT to provide information, you can ask it to answer only using reliable sources and to cite the sources:

Figure 5.14 – Best practice: provide citations

  • Break down a complex task into simpler tasks: See the following example:

Figure 5.15 – Best practice: break down a complex task