Category Vector DB sample scenario

Real-life examples of fine-tuning success – Fine-Tuning – Building Domain-Specific LLM Applications

Real-life examples of fine-tuning success

In this section, we’ll explore a real-life example of a fine-tuning approach that OpenAI implemented, which yielded remarkable outcomes.

InstructGPT

OpenAI’s InstructGPT is one of the most successful stories of fine-tuned models that laid the foundation of ChatGPT. ChatGPT is said to be a sibling model to InstructGPT. The methods that are used to fine-tune ChatGPT are similar to InstructGPT. InstructGPT was created by fine-tuning pre-trained GPT-3 models with RHLF. Supervised fine-tuning is the first step in RLHF for generating responses aligned to human preferences.

In the beginning, GPT-3 models weren’t originally designed to adhere to user instructions. Their training focused on predicting the next word based on vast amounts of internet text data. Therefore, these models underwent fine-tuning using instructional datasets along with RLHF to enhance their ability to generate more useful and relevant responses aligned with human values when prompted with user instructions:

Figure 3.20 – The fine-tuning process with RLHF

This figure depicts a schematic representation showcasing the InstructGPT fine- tuning process: (1) initial supervised fine-tuning, (2) training the reward model, and (3) executing RL through PPO using this established reward model. The utilization of this data to train respective models is indicated by the presence of blue arrows. In step 2, boxes A-D are samples from models that get ranked by labelers.

The following figure provides a comparison of the response quality of fine-tuned models with RLHF, supervised fine-tuned models, and general GPT models. The Y-axis consists of a Likert scale and shows quality ratings of model outputs on a 1–7 scale (Y-axis), for various model sizes (X-axis), on prompts submitted to InstructGPT models via the OpenAI API. The results reveal that InstructGPT outputs receive significantly higher scores by labelers compared to outputs from GPT-3 models with both few-shot prompts and those without, as well as models that underwent supervised learning fine-tuning. The labelers that were hired for this work were independent and were sourced from Scale AI and Upwork:

Figure 3.21 – Evaluation of InstructGPT (image credits: Open AI)

InstructGPT can be assessed across dimensions of toxicity, truthfulness, and appropriateness. Higher scores are desirable for TruthfulQA and appropriateness, whereas lower scores are preferred for toxicity and hallucinations. Measurement of hallucinations and appropriateness is conducted based on the distribution of prompts within our API. The outcomes are aggregated across various model sizes:

Figure 3.22 – Evaluation of InstructGPT

In this section, we introduced the concept of fine-tuning and discussed a success stories of fine-tuning with RLHF that led to the development of InstructGPT.

Summary

Fine-tuning is a powerful technique for customizing models, but it may not always be necessary. As observed, it can be time-consuming and may have initial upfront costs. It’s advisable to start with easier and faster strategies, such as prompt engineering with few- shot examples, followed by data grounding using RAG. Only if the responses from the LLM remain suboptimal should you consider fine-tuning. We will discuss RAG and prompt engineering in the following chapters.

In this chapter, we delved into critical fine-tuning strategies tailored for specific tasks. Then, we explored an array of evaluation methods and benchmarks to assess your refined model. The RLHF process ensures your models align with human values, making them helpful, honest, and safe. In the upcoming chapter, we’ll tackle RAG methods paired with vector databases – an essential technique to ground your enterprise data and minimize hallucinations in LLM-driven applications.

Vector search strategies – RAGs to Riches: Elevating AI with External Data

Vector search strategies

Vector search strategies are crucial because they determine how efficiently and accurately highly dimensional data (such as embeddings) can be queried and retrieved. Optimal strategies ensure that the most relevant and contextually appropriate results are returned. In vector-based searching, there are primarily two main strategies: exact search and approximate search.

Exact search

The exact search method, as the term suggests, directly matches a query vector with vectors in the database. It uses an exhaustive approach to identify the closest neighbors, allowing minimal to no errors.

This is typically what the traditional KNN method employs. Traditional KNNs utilize brute force methods to find the K-nearest neighbors, which demands a thorough comparison of the input vector with every other vector in the dataset. Although computing the similarity for each vector is typically quick, the process becomes time-consuming and resource-intensive over extensive datasets because of the vast number of required comparisons. For instance, if you had a dataset of one million vectors and wanted to find the nearest neighbors for a single input vector, the traditional KNN would require one million distance computations. This can be thought of as looking up a friend’s phone number in a phone book by checking each entry one by one rather than using a more efficient search strategy that speeds up the process, which we will discuss in the next section.

Approximate nearest neighbors (ANNs)

In modern vector DBs, the search strategy known as ANN stands out as a powerful technique that quickly finds the near-closest data points in highly dimensional spaces, potentially trading off a bit of accuracy for speed. Unlike KNN, ANN prioritizes search speed at the expense of slight accuracy. Additionally, for it to function effectively, a vector index must be built beforehand.

The process of vector indexing

The process of vector indexing involves the organization of embeddings in a data structure called an index, which can be traversed quickly for retrieval purposes. Many ANN algorithms aid in forming a vector index, all aiming for rapid querying by creating an efficiently traversable data structure. Typically, they compress the original vector representation to enhance the search process.

There are numerous indexing algorithms, and this is an active research area. ANNs can be broadly classified into tree-based indexes, graph-based indexes, hash-based indexes, and quantization-based indexes. In this section, we will cover the two most popular indexing algorithms. When creating an LLM application, you don’t need to dive deep into the indexing process since many vector databases provide this as a service to you. But it’s important to choose the right type of index for your specific needs to ensure efficient data retrieval:

  • Hierarchical navigable small world (HNSW): This is a method for approximate similarity search in highly dimensional spaces. HNSW is a graph-based index that works by creating a hierarchical graph structure, where each node represents a data point, and the edges connect similar data points. This hierarchical structure allows for efficient search operations, as it narrows down the search space quickly. HNSW is well suited for similarity search use cases, such as content-based recommendation systems and text search.

If you wish to dive deeper into its workings, we recommend checking out this research paper: https://arxiv.org/abs/1603.09320.

The following image is a representation of the HNSW index:

Figure 4.4 – Representation of HNSW index

The image illustrates the HNSW graph structure used for efficient similarity searches. The graph is constructed in layers, with decreasing density from the bottom to the top. Each layer’s characteristic radius reduces as we ascend, creating sparser connections. The depicted search path, using the red dotted lines, showcases the algorithm’s strategy; it starts from the sparsest top layer, quickly navigating vast data regions, and then refines its search in the denser lower layers, minimizing the overall comparisons and enhancing search efficiency.

  • Facebook AI Similarity Search (FAISS): FAISS, developed by Facebook AI Research, is a library designed for the efficient similarity search and clustering of highly dimensional vectors. It uses product quantization to compress data during indexing, accelerating similarity searches in vast datasets. This method divides the vector space into regions known as Voronoi cells, each symbolized by a centroid. The primary purpose is to minimize storage needs and expedite searches, though it may slightly compromise accuracy. To visualize this, consider the following image. The Voronoi cells denote regions from quantization, and the labeled points within these cells are the centroids or representative vectors. When indexing a new vector, it’s aligned with its closest centroid. For searches, FAISS pinpoints the probable Voronoi cell containing the nearest neighbors and then narrows down the search within that cell, significantly cutting down distance calculations:

Figure 4.5 – Representation of FAISS index

It excels in applications such as image and video search, recommendation systems, and any task that involves searching for nearest neighbors in highly dimensional spaces because of its performance optimizations and built-in GPU optimization.

In this section, we covered indexing and the role of ANNs in index creation. Next, we’ll explore similarity measures, how they differ from indexing, and their impact on improving data retrieval.

Vector stores – RAGs to Riches: Elevating AI with External Data

Vector stores

As generative AI applications continue to push the boundaries of what’s possible in tech, vector stores have emerged as a crucial component, streamlining and optimizing the search and retrieval of relevant data. In our previous discussions, we’ve delved into the advantages of vector DBs over traditional databases, unraveling the concepts of vectors, embeddings, vector search strategies, approximate nearest neighbors (ANNs), and similarity measures. In this section, we aim to provide an integrative understanding of these concepts within the realm of vector DBs and libraries.

The image illustrates a workflow for transforming different types of data—Audio, Text, and Videos— into vector embeddings.

  • Audio: An audio input is processed through an “Audio Embedding model,” resulting in “Audio vector embeddings.”
  • Text: Textual data undergoes processing in a “Text Embedding model,” leading to “Text vector embeddings.”
  • Videos: Video content is processed using a “Video Embedding model,” generating “Video vector embeddings.”

Once these embeddings are created, they are subsequently utilized (potentially in an enterprise vector database system) to perform “Similarity Search” operations. This implies that the vector embeddings can be compared to find similarities, making them valuable for tasks such as content recommendations, data retrieval, and more.

Figure 4.9 – Multimodal embeddings process in an AI application

What is a vector database?

A vector database (vector DB) is a specialized database designed to handle highly dimensional vectors primarily generated from embeddings of complex data types such as text, images, or audio. It provides capabilities to store and index unstructured data and enhance searches, as well as retrieval capabilities as a service.

Modern vector databases that are brimming with advancements empower you to architect resilient enterprise solutions. Here, we list 15 key features to consider when choosing a vector DB. Every feature may not be important for your use case, but it might be a good place to start. Keep in mind that this area is changing fast, so there might be more features emerging in the future:

  • Indexing: As mentioned earlier, indexing refers to the process of organizing highly dimensional vectors in a way that allows for efficient similarity searches and retrievals. A vector DB offers built-in indexing features designed to arrange highly dimensional vectors for swift and effective similarity-based searches and retrievals. Previously, we discussed indexing algorithms such as FAISS and HNSW. Many vector DBs incorporate such features natively. For instance, Azure AI Search integrates the HNSW indexing service directly.
  • Search and retrieval: Instead of relying on exact matches, as traditional databases do, vector DBs provide vector search capabilities as a service, such as approximate nearest neighbors (ANNs), to quickly find vectors that are roughly the closest to a given input. To quantify the closeness or similarity between vectors, they utilize similarity measures such as cosine similarity or Euclidean distance, enabling efficient and nuanced similarity-based searches in large datasets.
  • Create, read, update, and delete: A vector DB manages highly dimensional vectors and offers create, read, update, and delete (CRUD) operations tailored to vectorized data. When vectors are created, they’re indexed for efficient retrieval. Reading often means performing similarity searches to retrieve vectors closest to a given query vector, typically using methods such as ANNs. Vectors can be updated, necessitating potential re-indexing, and they can also be deleted, with the database adjusting its internal structures accordingly to maintain efficiency and consistency.
  • Security: This meets GDPR, SOC2 Type II, and HIPAA rules to easily manage access to the console and use SSO. Data is encrypted when stored and in transit, which also provides more granular identity and access management features.
  • Serverless: A high-quality vector database is designed to gracefully autoscale with low management overhead as data volumes soar into millions or billions of entries, distributing seamlessly across several nodes. Optimal vector databases grant users the flexibility to adjust the system in response to shifts in data insertion, query frequencies, and underlying hardware configurations.
  • Hybrid search: Hybrid search combines traditional keyword-based search methods with other search mechanisms, such as semantic or contextual search, to retrieve results from both the exact term matches and by understanding the underlying intent or context of the query, ensuring a more comprehensive and relevant set of results.
  • Semantic re-ranking: This is a secondary ranking step to improve the relevance of search results. It re-ranks the search results that were initially scored by state-of-the-art ranking algorithms such as BM25 and RRF based on language understanding. For instance, Azure AI search employs secondary ranking that uses multi-lingual, deep learning models derived from Microsoft Bing to elevate the results that are most relevant in terms of meaning.
  • Auto vectorization/embedding: Auto-embedding in a vector database refers to the automatic process of converting data items into vector representations for efficient similarity searches and retrieval, with access to multiple embedding models.
  • Data replication: This ensuresdata availability, redundancy, and recovery in case of failures, safeguarding business continuity and reducing data loss risks.
  • Concurrent user access and data isolation: Vector databases support a large number of users concurrently and ensure robust data isolation to ensure updates remain private unless deliberately shared.
  • Auto-chunking: Auto-chunking is the automated process of dividing a larger set of data or content into smaller, manageable pieces or chunks for easier processing or understanding. This process helps preserve the semantic relevance of texts and addresses the token limitations of embedding models. We will learn more about chunking strategies in the upcoming sections in this chapter.
  • Extensive interaction tools: Prominent vector databases, such as Pinecone, offer versatile APIs and SDKs across languages, ensuring adaptability in integration and management.
  • Easy integration: Vector DBs provide seamless integration with LLM orchestration frameworks and SDKs, such as Langchain and Semantic Kernel, and leading cloud providers, such as Azure, GCP, and AWS.
  • User-friendly interface: Thisensures an intuitive platform with simple navigation and direct feature access, streamlining the user experience.
  • Flexible pricing models: Provides flexible pricing models as per user needs to keep the costs low for the user.
  • Low downtime and high resiliency: Resiliency in a vector database (or any database) refers to its ability to recover quickly from failures, maintain data integrity, and ensure continuous availability even in the face of adverse conditions, such as hardware malfunctions, software bugs, or other unexpected disruptions.

As of early 2024, a few prominent open source vector databases include Chroma, Milvus, Quadrant, and Weaviate, while Pinecone and Azure AI search are among the leading proprietary solutions.

The role of vector DBs in retrieval-augmented generation (RAG) – RAGs to Riches: Elevating AI with External Data

The role of vector DBs in retrieval-augmented generation (RAG)

To fully understand RAG and the pivotal role of vector DBs within it, we must first acknowledge the inherent constraints of LLMs, which paved the way for the advent of RAG techniques powered by vector DBs. This section sheds light on the specific LLM challenges that RAG aims to overcome and the importance of vector DBs.

First, the big question – Why?

In Chapter 1, we delved into the limitations of LLMs, which include the following:

  • LLMs possess a fixed knowledge base determined by their training data; as of February 2024, ChatGPT’s knowledge is limited to information up until April 2023.
  • LLMs can occasionally produce false narratives, spinning tales or facts that aren’t real.
  • They lack personal memory, relying solely on the input context length. For example, take GPT4-32K; it can only process up to 32K tokens between prompts and completions (we’ll dive deeper into prompts, completions, and tokens in Chapter 5).

To counter these challenges, a promising avenue is enhancing LLM generation with retrieval components. These components can extract pertinent data from external knowledge bases—a process termed RAG, which we’ll explore further in this section.

So, what is RAG, and how does it help LLMs?

Retrievalaugmented generation (RAG) was first introduced in a paper titled Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (https://arxiv.org/pdf/2005.11401. pdf) in November 2020 by Facebook AI Research (now Meta). RAG is an approach that combines the generative capabilities of LLMs with retrieval mechanisms to extract relevant information from vast datasets. LLMs, such as the GPT variants, have the ability to generate human-like text based on patterns in their training data but lack the means to perform real-time external lookups or reference specific external knowledge bases post-training. RAG addresses this limitation by using a retrieval model to query a dataset and fetch relevant information, which then serves as the context for the generative model to produce a detailed and informed response. This also helps in grounding the LLM queries with relevant information that reduces the chances of hallucinations.

The critical role of vector DBs

A vector DB plays a crucial role in facilitating the efficient retrieval aspect of RAG. In this setup, each piece of information, such as text, video, or audio, in the dataset is represented as a highly dimensional vector and indexed in a vector DB. When a query from a user comes in, it’s also converted into a similar vector representation. The vector DB then rapidly searches for vectors (documents) in the dataset that are closest to the query vector, leveraging techniques such as ANN search. Then, it attaches the query with relevant content and sends it to the LLMs to generate a response. This ensures that the most relevant information is retrieved quickly and efficiently, providing a foundation for the generative model to build upon.

Chunking strategies – RAGs to Riches: Elevating AI with External Data

Chunking strategies

In our last discussion, we delved into vector DBs and RAG. Before diving into RAG, we need to efficiently house our embedded data. While we touched upon indexing methods to speed up data fetching, there’s another crucial step to take even before that: chunking.

What is chunking?

In the context of building LLM applications with embedding models, chunking involves dividing a long piece of text into smaller, manageable pieces or “chunks” that fit within the model’s token limit. The process involves breaking text into smaller segments before sending these to the embedding models. As shown in the following image, chunking happens before the embedding process. Different documents have different structures, such as free-flowing text, code, or HTML. So, different chunking strategies can be applied to attain optimal results. Tools such as Langchain provide you with functionalities to chunk your data efficiently based on the nature of the text.

The diagram below depicts a data processing workflow, highlighting the chunking step, starting with raw “Data sources” that are converted into “Documents.” Central to this workflow is the “Chunk” stage, where a “TextSplitter” breaks the data into smaller segments. These chunks are then transformed into numerical representations using an “Embedding model” and are subsequently indexed into a “Vector DB” for efficient search and retrieval. The text associated with the retrieved chunks is then sent as context to the LLMs, which then generate a final response:

Fig 4.12 – Chunking Process

But why is it needed?

Chunking is vital for two main reasons:

  • Chunking strategically divides document text to enhance its comprehension by embedding models, and it boosts the relevance of the content retrieved from a vector DB. Essentially, it refines the accuracy and context of the results sourced from the database.
  • It tackles the token constraints of embedding models. For instance, Azure’s OpenAI embedding models like text-embedding-ada-002 can handle up to 8,191 tokens, which is about 6,000 words, given each token averages four characters. So, for optimal embeddings, it’s crucial our text stays within this limit.

Popular chunking strategies

  • Fixed-size chunking: This is avery common approach that defines a fixed size (200 words), which is enough to capture the semantic meaning of a paragraph, and it incorporates an overlap of about 10–15% as an input to the vector embedding generation model. Chunking data with a slight overlap between text ensures context preservation. It’s advisable to begin with a roughly 10% overlap. Below is a snippet of code that demonstrates the use of fixed-size chunking with LangChain:

text = “Ladies and Gentlemen, esteemed colleagues, and honored \ guests. Esteemed leaders and distinguished members of the \ community. Esteemed judges and advisors. My fellow citizens. Last \ year, unprecedented challenges divided us. This year, we stand \ united, ready to move forward together”

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=20, chunk_overlap=5)

texts = text_splitter.split_text(text)

print(texts)

The output is the following:

[‘Ladies and Gentlemen, esteemed colleagues, and honored guests. Esteemed leaders and distinguished members’, ’emed leaders and distinguished members of the community. Esteemed judges and advisors. My fellow citizens.’, ‘. My fellow citizens. Last year, unprecedented challenges divided us. This year, we stand united,’, ‘, we stand united, ready to move forward together’]

  • Variable-size chunking: Variable-size chunking refers to the dynamic segmentation of data or text into varying-sized components, as opposed to fixed-size divisions. This approach accommodates the diverse structures and characteristics present in different types of data.
  • Sentence splitting: Sentence transformer models are neural architectures optimized for embedding at the sentence level. For example, BERT works best when chunked at the sentence level. Tools such as NLTK and SpaCy provide functions to split the sentences within a text.
  • Specialized chunking: Documents, such as research papers, possess a structured organization of sections, and the Markdown language, with its unique syntax, necessitates specialized chunking, resulting in the proper separation between sections/pages to yield contextually relevant chunks.
  • Code Chunking: When embedding code into your vector DB, this technique can be invaluable. Langchain supports code chunking for numerous languages. Below is a snippet code to chunk your Python code:

from langchain.text_splitter import (

RecursiveCharacterTextSplitter,

Language,

)

PYTHON_CODE = “””

class SimpleCalculator:

def add(self, a, b):

return a + b

def subtract(self, a, b):

return a – b

  • Using the SimpleCalculator calculator = SimpleCalculator() sum_result = calculator.add(5, 3) diff_result = calculator.subtract(5, 3)

“””

python_splitter = RecursiveCharacterTextSplitter.from_language(

language=Language.PYTHON, chunk_size=50, chunk_overlap=0

)

python_docs = python_splitter.create_documents([PYTHON_CODE]) python_docs

The output is the following:

[Document(page_content=’class SimpleCalculator:\n def add(self, a, b):’),

Document(page_content=’return a + b’),

Document(page_content=’def subtract(self, a, b):’),

Document(page_content=’return a – b’),

Document(page_content=’# Using the SimpleCalculator’),

Document(page_content=’calculator = SimpleCalculator()’),

Document(page_content=’sum_result = calculator.add(5, 3)’),

Document(page_content=’diff_result = calculator.subtract(5, 3)’)]

Chunking considerations

Chunking strategies vary based on data type and format and the chosen embedding model. For instance, code requires a distinct chunking approach compared to unstructured text. While models such as text-embedding-ada-002 excel with 256- and 512-token-sized chunks, our understanding of chunking is ever-evolving. Moreover, preprocessing plays a crucial role before chunking, where you can optimize your content by removing unnecessary text content, such as stop words, special symbols, etc., that add noise. For the latest techniques, we suggest regularly checking the text splitters section in the LangChain documentation, ensuring you employ the best strategy for your needs

(Split by tokens from Langchain: https://python.langchain.com/docs/modules/ data_connection/document_transformers/split_by_token).

Case study – Global chat application deployment by a multinational organization – RAGs to Riches: Elevating AI with External Data

Case study – Global chat application deployment by a multinational organization

A global firm recently launched an advanced internal chat application featuring a Q&A support chatbot. This innovative tool, deployed across various Azure regions, integrates several large language models, including the specialized finance model, BloombergGPT. To meet specific organizational requirements, bespoke plugins were developed. It had an integration with Service Now, empowering the chatbot to streamline ticket generation and oversee incident actions.

In terms of data refinement, the company meticulously preprocessed its knowledge base (KB) information, eliminating duplicates, special symbols, and stop words. The KB consisted of answers to frequently asked questions and general information to various support-related questions. They employed fixed chunking approaches, exploring varied chunk sizes, before embedding these data into the Azure AI search. Their methodology utilized Azure OpenAI’s text-ada-embedding-002 models in tandem with the cosine similarity metric and Azure AI search’s vector search capabilities.

From their extensive testing, they discerned optimal results with a chunk size of 512 tokens and a 10% overlap. Moreover, they adopted an ANN vector search methodology using cosine similarity. They also incorporated hybrid search that included keyword and semantic search with Semantic Reranker. Their RAG workflow, drawing context from Azure Vector Search and the GPT 3.5 Turbo-16K models, proficiently generated responses to customer support inquiries. They implemented caching techniques using Azure Cache Redis and rate-limiting strategies using Azure API Management to optimize the costs.

The integration of the support Q&A chatbot significantly streamlined the multinational firm’s operations, offering around-the-clock, consistent, and immediate responses to queries, thereby enhancing user satisfaction. This not only brought about substantial cost savings by reducing human intervention but also ensured scalability to handle global demands. By automating tasks such as ticket generation, the firm gained deeper insights into user interactions, allowing for continuous improvement and refinement of their services.

Summary

In this chapter, we explored the RAG approach, a powerful method for leveraging your data to craft personalized experiences, reduce hallucinations while also addressing the training limitations inherent in LLMs. Our journey began with an examination of foundational concepts such as vectors and databases, with a special focus on Vector Databases. We understood the critical role that Vector DBs play in the development of RAG-based applications, also highlighting how they can enhance LLM responses through effective chunking strategies. The discussion also covered practical insights on building engaging RAG experiences, evaluating them through prompt flow, and included a hands-on lab available on GitHub to apply what we’ve learned.

In the next chapter we will introduce another popular technique designed to minimize hallucinations and more easily steer the responses of LLMs. We will cover prompt engineering strategies, empowering you to fully harness the capabilities of your LLMs and engage more effectively with AI. This exploration will provide you with the tools and knowledge to enhance your interactions with AI, ensuring more reliable and contextually relevant outputs.

Token limits in ChatGPT models – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Token limits in ChatGPT models

Depending on the model, the token limits on the model will vary. As of Feb 2024, the token limit for the family of GPT-4 models ranges from 8,192 to 128,000 tokens. This means the sum of prompt and completion tokens for an API call cannot exceed 32,768 tokens for the GPT-4-32K model. If the prompt is 30,000 tokens, the response cannot be more than 2,768 tokens. The GPT4-Turbo 128K is the most recent model as of Feb 2024, with 128,000 tokens, which is close to 300 pages of text in a single prompt and completion. This is a massive context prompt compared to its predecessor models.

Though this can be a technical limitation, there are creative ways to address the problem of limitation, such as using chunking and condensing your prompts. We discussed chunking strategies in Chapter 4, which can help you address token limitations.

The following figure shows various models and token limits:

Model Token Limit

GPT-3.5-turbo4,096
GPT-3.5-turbo-16k16,384
GPT-3.5-turbo-06134,096
GPT-3.5-turbo-16k-061316,384
GPT-48,192
GPT-4-061332,768
GPT-4-32K32,768
GPT-4-32-061332,768
GPT-4-Turbo 128K128,000

Figure 5.4 – Models and associated Token Limits

For the latest updates on model limits for newer versions of models, please check the OpenAI website.

Tokens and cost considerations

The cost of using ChatGPT or similar models via an API is often tied to the number of tokens processed, encompassing both the input prompts and the model’s generated responses.

In terms of pricing, providers typically have a per-token charge, leading to a direct correlation between conversation length and cost; the more tokens processed, the higher the cost. The latest cost updates can be found on the OpenAI website.

From an optimization perspective, understanding this cost-token relationship can guide more efficient API usage. For instance, creating more succinct prompts and configuring the model for brief yet effective responses can help control token count and, consequently, manage expenses.

We hope you now have a good understanding of the key components of a prompt. Now, you are ready to learn about prompt engineering. In the next section, we will explore the details of prompt engineering and effective strategies, enabling you to maximize the potential of your prompt contents through the one-shot and few-shot learning approaches.

Prompt parameters – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Prompt parameters

ChatGPT prompt parameters are variables that you can set in the API calls. They allow users to influence the model’s output, customizing the behavior of the model to better fit specific applications or contexts. The following table shows some of the most important parameters of a ChatGPT API call:

Figure 5.6 – Essential Prompt Parameters

In this section, only the top parameters for building an effective prompt are highlighted. For a full list of parameters, refer to the OpenAI API reference (https://platform.openai.com/docs/ api-reference).

ChatGPT roles

System message

This is the part where youdesign your metaprompts. Metaprompts help to set the initial context, theme, and behavior of the ChatGPT API to guide the model’s interactions with the user, thus setting roles or response styles for the assistant.

Metaprompts are structured instructions or guidelines that dictate how the system should interpret and respond to user requests. These metaprompts are designed to ensure that the system’s outputs adhere to specific policies, ethical guidelines, or operational rules. They’re essentially “prompts about how to handle prompts,” guiding the system in generating responses, handling data, or interacting with users in a way that aligns with predefined standards.

The following table is a metaprompt framework that you can follow to design the ChatGPT system message:

Figure 5.7 – Elements of a Metaprompt

User

The messages from the user serve as prompts or remarks that the assistant is expected to react to or engage with. what is it establishes the anticipated scope of queries that may come from the user.

Techniques for effective prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Techniques for effective prompt engineering

In the past two years, a wide array of prompt -engineering techniques have been developed. This section focuses on the essential ones, offering key strategies that you might find indispensable for daily interactions with ChatGPT and other LLM-based applications.

N-shot prompting

N-shot prompting is a term used in the context of training large language models, particularly for zero-shot or few-shot learning tasks. It is also called in-context learning and refers to the techniqueof providing the model with example prompts along with corresponding responses during training to steer the model’s behavior to provide more accurate responses.

The “N” in “N-shot” refers to the number of example prompts provided to the model. For instance, in a one-shot learning scenario, only one example prompt and its response are given to the model. In an N-shot learning scenario, multiple example prompts and responses are provided.

While ChatGPT works great with zero-shot prompting, it may sometimes be useful to provide examples for a more accurate response. Let’s see some examples of zero-shot and few-shot prompting:

Figure 5.8 – N-shot prompting examples

Chain-of-thought (CoT) prompting

Chain-of -thought prompting refers to a sequence of intermediate reasoning steps, significantly boosting the capability of large language models to tackle complex reasoning tasks. By presenting a few chain-of-thought demonstrations as examples in the prompts, the models proficiently handle intricate reasoning tasks:

Figure 5.9 – Chain-of-Thought Prompting Examples

Figure sourced from https://arxiv.org/pdf/2201.11903.pdf.

Program-aided language (PAL) models

Program- aided language (PAL) models, also called program-of -thought prompting ( PoT), is a technique that incorporates additional task-specific instructions, pseudo-code, rules, or programs alongside the free-form text to guide the behavior of a language model:

Figure 5.10 – Program-aided language prompting examples

Figure sourced from https://arxiv.org/abs/2211.10435.

In this section, although we have not explored all prompt engineering techniques (only the most important ones), we want to convey to our readers that there are numerous variants of these techniques, as illustrated in the following figure from the research paper A Systematic Survey of prompt engineering in Large Language Models: Techniques and Applications (https://arxiv.org/pdf/2402.07927. pdf). This paper provides an extensive inventory of prompt engineering strategies across various application areas, showcasing the evolution and breadth of this field over the last four years:

Figure 5.11 – Taxonomy of prompt engineering techniques across multiple application domains

Prompt engineering best practices – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Prompt engineering best practices

In the following list, we outline additional best practices to optimize and enhance your experience with prompt creation:

  • Clarity and precision for accurate responses: Ensure that prompts are clear, concise, and specific, avoiding ambiguity or multiple interpretations:

Figure 5.12 – Best practice: clarity and precision

•   Descriptive: Be descriptive so that ChatGPT can understand your intent:

Figure 5.13 – Best practice: be descriptive

  • Format the output: Mention the format of the output, which can be bullet points, paragraphs, sentences, tables, and languages, such as XML, HTML, and JSON. Use examples to articulate the desired output.
  • Adjust the Temperature and Top_p parameters for creativity: As indicated in the parameters section, modifying the Temperatures and Top_p can significantly influence the variability of the model’s output. In scenarios that call for creativity and imagination, raising the temperature proves beneficial. On the other hand, when dealing with legal applications that demand a reduction in hallucinations, a lower temperature becomes advantageous.
  • Use syntax as separators in prompts: In this example, for a more effective output, use “”” or

### to separate instruction and input data:

Example:

Convert the text below to Spanish

Text: “””

{text input here}

“””

  • Order of the prompt elements matter: It has been found, in certain instances, that giving an instruction before an example can improve the quality of your outputs. Additionally, the order of examples can affect the output of prompts.
  • Use guiding words: Thishelps steer the model toward a specific structure, such as the text highlighted in the following:

Example:

#Create a basic Python function that

#1. Requests the user to enter a temperature in Celsius

#2. Converts the Celsius temperature to Fahrenheit def ctf():

  • Instead of saying what not to provide, give alternative recommendations: Provide an alternative path if ChatGPT is unable to perform a task, such as in the following highlighted message:

Example:

System Message: You are an AI nutrition consultant that provides nutrition consultation based on health and wellness goals of the customer Please note that any questions or inquiries beyond the scope of nutrition consultation will NOT be answered and instead will receive the response: “Sorry! This question falls outside my domain of expertise!”

Customer: How do I invest in 401K?

Nutrition AI Assistant: “Sorry! This question falls outside my domain of expertise!”

  • Provide example-based prompts: This helps the language model learn from specific instances and patterns. Start with a zero-shot, then a few-shot, and if neither of them works, then fine-tune the model.
  • Ask ChatGPT to provide citations/sources: When asking ChatGPT to provide information, you can ask it to answer only using reliable sources and to cite the sources:

Figure 5.14 – Best practice: provide citations

  • Break down a complex task into simpler tasks: See the following example:

Figure 5.15 – Best practice: break down a complex task