Archives 2022

Case study – Global chat application deployment by a multinational organization – RAGs to Riches: Elevating AI with External Data

Case study – Global chat application deployment by a multinational organization

A global firm recently launched an advanced internal chat application featuring a Q&A support chatbot. This innovative tool, deployed across various Azure regions, integrates several large language models, including the specialized finance model, BloombergGPT. To meet specific organizational requirements, bespoke plugins were developed. It had an integration with Service Now, empowering the chatbot to streamline ticket generation and oversee incident actions.

In terms of data refinement, the company meticulously preprocessed its knowledge base (KB) information, eliminating duplicates, special symbols, and stop words. The KB consisted of answers to frequently asked questions and general information to various support-related questions. They employed fixed chunking approaches, exploring varied chunk sizes, before embedding these data into the Azure AI search. Their methodology utilized Azure OpenAI’s text-ada-embedding-002 models in tandem with the cosine similarity metric and Azure AI search’s vector search capabilities.

From their extensive testing, they discerned optimal results with a chunk size of 512 tokens and a 10% overlap. Moreover, they adopted an ANN vector search methodology using cosine similarity. They also incorporated hybrid search that included keyword and semantic search with Semantic Reranker. Their RAG workflow, drawing context from Azure Vector Search and the GPT 3.5 Turbo-16K models, proficiently generated responses to customer support inquiries. They implemented caching techniques using Azure Cache Redis and rate-limiting strategies using Azure API Management to optimize the costs.

The integration of the support Q&A chatbot significantly streamlined the multinational firm’s operations, offering around-the-clock, consistent, and immediate responses to queries, thereby enhancing user satisfaction. This not only brought about substantial cost savings by reducing human intervention but also ensured scalability to handle global demands. By automating tasks such as ticket generation, the firm gained deeper insights into user interactions, allowing for continuous improvement and refinement of their services.

Summary

In this chapter, we explored the RAG approach, a powerful method for leveraging your data to craft personalized experiences, reduce hallucinations while also addressing the training limitations inherent in LLMs. Our journey began with an examination of foundational concepts such as vectors and databases, with a special focus on Vector Databases. We understood the critical role that Vector DBs play in the development of RAG-based applications, also highlighting how they can enhance LLM responses through effective chunking strategies. The discussion also covered practical insights on building engaging RAG experiences, evaluating them through prompt flow, and included a hands-on lab available on GitHub to apply what we’ve learned.

In the next chapter we will introduce another popular technique designed to minimize hallucinations and more easily steer the responses of LLMs. We will cover prompt engineering strategies, empowering you to fully harness the capabilities of your LLMs and engage more effectively with AI. This exploration will provide you with the tools and knowledge to enhance your interactions with AI, ensuring more reliable and contextually relevant outputs.

The essentials of prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

The essentials of prompt engineering

Before discussing prompt engineering, it is important to first understand the foundational components of a prompt. In this section, we’ll delve into the key components of a prompt, such as ChatGPT prompts, completions, and tokens. Additionally, grasping what tokens are is pivotal to understanding the model’s constraints and managing costs.

ChatGPT prompts and completions

A prompt is an input provided to LLMs, whereas completions refer to the output of LLMs. The structure and content of a prompt can vary based on the type of LLM (e.g., the text or image generation model), specific use cases, and the desired output of the language model.

Completions refer to the response generated by ChatGPT prompts; basically, it is an answer to your questions. Check out the following example to understand the difference between prompts and completions when we prompt ChatGPT with, “What is the capital of India?”

Figure 5.2 – An image showing a sample LLM prompt and completion

Based on the use case, we can leverage one of the two ChatGPT API calls, named Completions or ChatCompletions, to interact with the model. However, OpenAI recommends using the ChatCompletions API in the majority of scenarios.

Completions API

The Completions API is designed to generate creative, free-form text. You provide a prompt, and the API generates text that continues from it. This is often used for tasks where you want the model to answer a question or generate creative text, such as for writing an article or a poem.

ChatCompletions API

The ChatCompletions API is designed for multi-turn conversations. You send a series of messages instead of a single prompt, and the model generates a message as a response. The messages sent to the model include a role (which can be a system, user, or assistant) and the content of the message. The system role is used to set the behavior of the assistant, the user role is used to instruct the assistant, and the model’s responses are under the assistant role.

The following is an example of a sample ChatCompletions API call:

import openai

openai.api_key = ‘your-api-key’

response = openai.ChatCompletion.create(

model=”gpt-3.5-turbo”,

messages=[

{“role”: “system”, “content”: “You are a helpful sports \

assistant.”},

{“role”: “user”, “content”: “Who won the cricket world cup \

in 2011?”},

{“role”: “assistant”, “content”: “India won the cricket \

world cup in 2011″},

{“role”: “assistant”, “content”: “Where was it played”}

]

)

print(response[‘choices’][0][‘message’][‘content’])

The main difference between the Completions API and ChatCompletions API is that the Completions API is designed for single-turn tasks, while the ChatCompletions API is designed to handle multiple turns in a conversation, making it more suitable for building conversational agents. However, the ChatCompletions API format can be modified to behave as a Completions API by using a single user message.

Important note

The CompletionsAPI, launched in June 2020, initially offered a freeform text interface for Open AI’s language models. However, experience has shown that structured prompts often yield better outcomes. The chat-based approach, especially through the ChatCompletions API, excels in addressing a wide array of needs, offering enhanced flexibility and specificity and reducing prompt injection risks. Its design supports multi-turn conversations and a variety of tasks, enabling developers to create advanced conversational experiences. Hence, Open AI announced that they would be deprecating some of the older models using Completions API and, in moving forward, they would be investing in the ChatCompletions API to optimize their efforts to use compute capacity. While the Completions API will remain accessible, it shall be labeled as “legacy” in the Open AI developer documentation.

Tokens

Understanding the concepts of tokens is essential, as it helps us better comprehend the restrictions, such as model limitations, and the aspect of cost management when utilizing ChatGPT.

A ChatGPT token is a unit of text that ChatGPT’s language model uses to understand and generate language. In ChatGPT, a token is a sequence of characters that the model uses to generate new sequences

of tokens and form a coherent response to a given prompt. The models use tokens to represent words, phrases, and other language elements. The tokens are not cut where the word starts or ends but can consist of trailing spaces, sub words and punctuations, too.

As stated on the OpenAI website, tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens.

To understand tokens in terms of lengths, the following is used as a rule of thumb:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words
  • 1–2 sentences ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens
  • 1 US page (8 ½” x 11”) ~= 450 tokens (assuming ~1800 characters per page)

For example, this famous quote from Thomas Edison (“Genius is one percent inspiration and ninety-nine percent perspiration.”) has 14 tokens:

Figure 5.3 – Tokenization of sentence

We used the OpenAI Tokenizer tool to calculate the tokens; the tool can be found at https:// platform.openai.com/tokenizer. An alternative way to tokenize text (programmatically) is to use the Tiktoken library on Github; this can be found at https://github.com/openai/

tiktoken.

Token limits in ChatGPT models – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Token limits in ChatGPT models

Depending on the model, the token limits on the model will vary. As of Feb 2024, the token limit for the family of GPT-4 models ranges from 8,192 to 128,000 tokens. This means the sum of prompt and completion tokens for an API call cannot exceed 32,768 tokens for the GPT-4-32K model. If the prompt is 30,000 tokens, the response cannot be more than 2,768 tokens. The GPT4-Turbo 128K is the most recent model as of Feb 2024, with 128,000 tokens, which is close to 300 pages of text in a single prompt and completion. This is a massive context prompt compared to its predecessor models.

Though this can be a technical limitation, there are creative ways to address the problem of limitation, such as using chunking and condensing your prompts. We discussed chunking strategies in Chapter 4, which can help you address token limitations.

The following figure shows various models and token limits:

Model Token Limit

GPT-3.5-turbo4,096
GPT-3.5-turbo-16k16,384
GPT-3.5-turbo-06134,096
GPT-3.5-turbo-16k-061316,384
GPT-48,192
GPT-4-061332,768
GPT-4-32K32,768
GPT-4-32-061332,768
GPT-4-Turbo 128K128,000

Figure 5.4 – Models and associated Token Limits

For the latest updates on model limits for newer versions of models, please check the OpenAI website.

Tokens and cost considerations

The cost of using ChatGPT or similar models via an API is often tied to the number of tokens processed, encompassing both the input prompts and the model’s generated responses.

In terms of pricing, providers typically have a per-token charge, leading to a direct correlation between conversation length and cost; the more tokens processed, the higher the cost. The latest cost updates can be found on the OpenAI website.

From an optimization perspective, understanding this cost-token relationship can guide more efficient API usage. For instance, creating more succinct prompts and configuring the model for brief yet effective responses can help control token count and, consequently, manage expenses.

We hope you now have a good understanding of the key components of a prompt. Now, you are ready to learn about prompt engineering. In the next section, we will explore the details of prompt engineering and effective strategies, enabling you to maximize the potential of your prompt contents through the one-shot and few-shot learning approaches.

What is prompt engineering? – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

What is prompt engineering?

Prompt engineering is the art of crafting or designing prompts to unlock desired outcomes from large language models or AI systems. The concept of prompt engineering revolves around the fundamental idea that the quality of your response is intricately tied to the quality of the question you pose. By strategically engineering prompts, one can influence the generated outputs and improve the overall performance and usefulness of the system. In this section, we will learn about the necessary elements of effective prompt design, prompt engineering techniques, best practices, bonus tips, and tricks.

Elements of a good prompt design

Designing a good prompt is important because it significantly influences the output of a language model such as GPT. The prompt provides the initial context, sets the task, guides the style and structure of the response, reduces ambiguities and hallucinations, and supports the optimization of resources, thereby reducing costs and energy use. In this section, let’s understand the elements of good prompt design.

The foundational elements of a good prompt include instructions, questions, input data, and examples:

  • Instructions: The instructions in a prompt refer to the specific guidelines or directions given to a language model within the input text to guide the kind of response it should produce.
  • Questions: Questions in a prompt refer to queries or interrogative statements that are included in the input text. The purpose of these questions is to instruct the language model to provide a response or an answer to the query. In order to obtain the results, either the question or instruction is mandatory.
  • Input data: The purpose of input data is to provide any additional supporting context when prompting the LLM. It could be used to provide new information the model has not previously been trained on for more personalized experiences.
  • Examples: The purpose of examples in a prompt is to provide specific instances or scenarios that illustrate the desired behavior or response from ChatGPT. You can input a prompt that includes one or more examples, typically in the form of input-output pairs.

The following table shows how to build effective prompts using the aforementioned prompt elements:

Figure 5.5 – Sample Prompt formula consisting of prompt elements with examples

Prompt parameters – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Prompt parameters

ChatGPT prompt parameters are variables that you can set in the API calls. They allow users to influence the model’s output, customizing the behavior of the model to better fit specific applications or contexts. The following table shows some of the most important parameters of a ChatGPT API call:

Figure 5.6 – Essential Prompt Parameters

In this section, only the top parameters for building an effective prompt are highlighted. For a full list of parameters, refer to the OpenAI API reference (https://platform.openai.com/docs/ api-reference).

ChatGPT roles

System message

This is the part where youdesign your metaprompts. Metaprompts help to set the initial context, theme, and behavior of the ChatGPT API to guide the model’s interactions with the user, thus setting roles or response styles for the assistant.

Metaprompts are structured instructions or guidelines that dictate how the system should interpret and respond to user requests. These metaprompts are designed to ensure that the system’s outputs adhere to specific policies, ethical guidelines, or operational rules. They’re essentially “prompts about how to handle prompts,” guiding the system in generating responses, handling data, or interacting with users in a way that aligns with predefined standards.

The following table is a metaprompt framework that you can follow to design the ChatGPT system message:

Figure 5.7 – Elements of a Metaprompt

User

The messages from the user serve as prompts or remarks that the assistant is expected to react to or engage with. what is it establishes the anticipated scope of queries that may come from the user.

Assistant – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Assistant

While not mandatory, assistant messages can be included to illustrate the preferred conduct or response patterns of the chatbot.

Now let’s look at an example; if you are developing an application for a fitness AI assistant, a very well-rounded system message might look as follows:

openai.ChatCompletion.create(

model=”gpt-3.5-turbo”,

messages=[

{

“role”: “system”,

“content”:

“You are a fitness AI assistant. Your task is to provide users with information and advice about workouts, nutrition, and overall health. You will help users by answering their questions

to the best of your abilities and providing suggestions based on their input. Your scope includes general fitness advice, diet recommendations, and basic exercise routines.

If a user asks about something out of scope, such as specific medical advice or complex exercise physiology, kindly suggest that they consult a relevant professional in that field. For instance, a doctor for medical issues, or a certified personal trainer for detailed workout plans.

You should maintain a supportive and positive tone, being respectful and encouraging to all users, regardless of their current fitness level or knowledge. Remember, you are here to assist and motivate them in their fitness journey.

Your responses should be concise, easy to understand, and practically useful. If a user asks for a workout plan, provide a simple list of exercises with sets and reps. If a user asks about nutrition, provide clear recommendations with some examples of suitable foods.”

},

{“role”: “user”, “content”: “I’m new to fitness, can you suggest a simple workout plan to get started?”},

{

“role”: “assistant”,

“content”:

“Absolutely, I’m here to help you get started on your fitness journey. Here’s a simple workout plan you can begin with:

1. Warm-up: 5-10 minutes of brisk walking or light jogging

2. Squats: 2 sets of 10 reps

3. Push-ups: 2 sets of 10 reps (you can start with knee push-ups if full push-ups are too challenging)

4. Lunges: 2 sets of 10 reps per leg

5. Plank: Hold for 30 seconds, repeat 2 times

6. Cool down: 5-10 minutes of stretching

Remember, the key to a successful workout routine is consistency. Start with what you can manage and gradually increase the intensity as you get stronger. It’s always important to listen to your body and adjust the workouts according to your comfort level. Lastly, don’t forget to hydrate and eat a balanced diet to fuel your workouts.

Please consult a fitness professional if you’re unsure about the correct form for any of these exercises to avoid injury. Enjoy your workout!”

}

]

)

The following link provides great examples of “Act As” prompts for you to leverage in your system message:

Awesome ChatGPT Prompts (https://github.com/f/awesome-chatgpt-prompts/).

Techniques for effective prompt engineering – Effective Prompt Engineering Techniques: Unlocking Wisdom Through AI

Techniques for effective prompt engineering

In the past two years, a wide array of prompt -engineering techniques have been developed. This section focuses on the essential ones, offering key strategies that you might find indispensable for daily interactions with ChatGPT and other LLM-based applications.

N-shot prompting

N-shot prompting is a term used in the context of training large language models, particularly for zero-shot or few-shot learning tasks. It is also called in-context learning and refers to the techniqueof providing the model with example prompts along with corresponding responses during training to steer the model’s behavior to provide more accurate responses.

The “N” in “N-shot” refers to the number of example prompts provided to the model. For instance, in a one-shot learning scenario, only one example prompt and its response are given to the model. In an N-shot learning scenario, multiple example prompts and responses are provided.

While ChatGPT works great with zero-shot prompting, it may sometimes be useful to provide examples for a more accurate response. Let’s see some examples of zero-shot and few-shot prompting:

Figure 5.8 – N-shot prompting examples

Chain-of-thought (CoT) prompting

Chain-of -thought prompting refers to a sequence of intermediate reasoning steps, significantly boosting the capability of large language models to tackle complex reasoning tasks. By presenting a few chain-of-thought demonstrations as examples in the prompts, the models proficiently handle intricate reasoning tasks:

Figure 5.9 – Chain-of-Thought Prompting Examples

Figure sourced from https://arxiv.org/pdf/2201.11903.pdf.

Program-aided language (PAL) models

Program- aided language (PAL) models, also called program-of -thought prompting ( PoT), is a technique that incorporates additional task-specific instructions, pseudo-code, rules, or programs alongside the free-form text to guide the behavior of a language model:

Figure 5.10 – Program-aided language prompting examples

Figure sourced from https://arxiv.org/abs/2211.10435.

In this section, although we have not explored all prompt engineering techniques (only the most important ones), we want to convey to our readers that there are numerous variants of these techniques, as illustrated in the following figure from the research paper A Systematic Survey of prompt engineering in Large Language Models: Techniques and Applications (https://arxiv.org/pdf/2402.07927. pdf). This paper provides an extensive inventory of prompt engineering strategies across various application areas, showcasing the evolution and breadth of this field over the last four years:

Figure 5.11 – Taxonomy of prompt engineering techniques across multiple application domains