A PM's Guide to Building with LLMs, Part 1: Basic Building Blocks.

Almost every product manager is being asked to be an 'AI Product Manager' these days. This series will get you started understanding what that means.

Pieter van Noordennen | Nov 7, 2023

A PM's Guide to Building with LLMs, Part 1: Basic Building Blocks.

What are we doing with AI? —Every CEO everywhere

If you work in product or growth, chances are you’re being asked to think about how you can leverage AI (specifically, today’s crop of Large Language Models) to enhance your product.

While many PMs I know use ChatGPT daily, few understand the basic building blocks of creating AI-driven products — and, more importantly, how they differ from traditional software engineering approaches.

Today’s LLMs and their associated agents are probabilistic models, a different paradigm than deterministic API calls and micro-services. They require careful consideration in use case development, UX choices, and pipeline construction to build results that users will love.

So how do you develop products when the outputs of your core systems are predictive?

To me, it starts with understanding how apps are created with LLMs. In this series, I’ll give a high-level overview of the systems, pipelines, and components used in building products with popular LLMs like GPT-4 and Claude.

**Note:**I’ll finalize the Table of Contents once I’m deeper into the series. Things change so fast here in AI Builder Land that I’m one announcement away from obsolescence at all times.

Basic Building Blocks (this article)
Advanced Techniques: Agents, Chains, and Retrieval
Model Evaluation and Why You Need It
Model Eval Tools and Techniques
Pre-Production QA
Post-Production LLM Observability

So let’s get started.

User Queries and API Calls

Workflow diagram showing query -> api call -> response

Every good Comp Sci class starts with basic I/O. In the case of LLMs, our inputs are User Queries, some question or piece of information the user wants the LLM to answer. Why is the sky blue? Can you create a diet plan for me (p.s., I don’t like tomatoes)? Query the database and tell me how many customers are named ‘Fred’.

Outputs are the responses from the LLM. Here we’re specifically talking about text-to-text generation, though the basic principles apply for text-to-image, voice-to-text, and most other use cases for current crop of Generative AI models.

And while chatbots are the preferred UX for many an AI application, product teams need to be a little craftier than just iFraming a ChatGPT window into their product if they want to wow users.

Enter the APIs. The major model companies like OpenAI and Anthropic have robust and constantly improving API interfaces to interact with models (and get charged on a per-token basis). For all other models, there’s Hugging Face, the open-source model repository that’s reared recent successes like MosaicML’s MPT and MistralAI’s suite of specialized small models.

We’ll get into model selection in the next section, but it’s worth knowing that most engineering teams will employ one of the open-source abstraction layers, LangChain and LlamaIndex being the two most popular, to manage programmatic API calls to the LLMs. These orchestration packages come with a raft of other useful capabilities for generating agents, using tools, and evaluating outputs. We’ll go deeper on these advanced capabilities in the next article.

The big cloud platforms are also getting into the AI DevX platform space: AWS with Bedrock, CloudFlare with Workers AI, NVIDIA with DGX … the list goes on.

While you likely won’t be making these types of tech platform choices, it’s good to study up on these tools if you want to have a productive conversation about your underlying tech stack.

Model Selection

AI Tech Stack Diagram

I’ll make the assumption that you’re not on a product team that is spending half-a-million bucks to create your own Foundational Model. (If you are, I wish you luck — and please give me your thoughts on Model Evaluation.)

This means you’ll be using someone else’s model (GPT-4, Claude 2, LLaMaIndex, etc) to generate whatever outputs you’re looking for, and fine-tuning them with your own data.

Model selection is a critical decision for product team to make for a few reasons.

Cost: One key thing to remember about AI development is that the marginal costs aren’t zero — as we tend to take for granted in most traditional software development. GPU shortages, usage-based pricing, and ease of use all allow model companies to command premiums on using their models. Which foundation model (and associated ecosystem) you choose will be the single more impactful choice you make in terms of cost.
Differentiation: Which model you choose and how you plan to tune it for your use case is also at the heart of differentiation. Specialized models continue to proliferate, but many successful products have launched simply with a novel use of GPT-4. And, more commonly, you’ll likely have a few (hundred) competitors creating a “thin GPT-4 wrapper” version of your product to simply try and out-market you. Be clear with your team what gives you a right to win, and choose a model that best supports that strategy.
Use Case: LLMs are no different than other ML models in that they each have trade-offs depending on the use case. Some models tend to hallucinate more than others. Some are better at math. I know at least one that is trained on …

While pro/con commentary on specific models is outside the scope of this article (and would be out-of-date before I published it), here are a few ways of thinking about model selection criteria:

Open Source or Not: There’s a lot of debate on how “open” open source models are, but it is true that there are models LlaMa2 or Dolly that are free to use courtesy of Hugging Face. Which sounds appealing when you’re spending thousands of dollars in monthly OpenAI bills. But the old adage that “open source is only free if your time has no value” applies here as well, as the developer eco-systems around paid models tends to be better.

Data Privacy: A current sales tactic of some large cloud providers is to point out that OpenAI may use your internal data for training if you use ChatGPT. I think this is just FUD marketing, to be honest, as OpenAI APIs and Enterprise licenses come with data privacy guards. Still, selecting a model is the same as buying any other piece of software: Be careful who you entrust with your (user’s) data.

Large vs Small: LLMs come in two main sizes: Large models like GPT-4 and Claude 2 are trained on ~40 billion parameters; Small models like MosaicML’s MPT are training on closer to 7 billion. Small model proponents claim their performance rivals the large models and that they can be fine-tuned faster and less expensively. Large models seem to have the lion’s share of the market now, but many investors (not to mention DataBricks, who acquired MosaicML for $1.3 billion) are betting on small model specialization. If you have highly specialized proprietary data and use cases, fine-tuning a small foundation model may be a worthy experiment. For all others, one of the major foundation models should suffice.

Specialized vs General Purpose: As mentioned, specialized models hold a lot of promise, though most builders I know tend towards the ease and performance of general purpose large models. Companies like MistralAI and Wizard are releasing some interesting models to try in areas like code generation and storytelling.

The bottom line in model selection is to trust your engineers and clearly define your use cases, data, and differentiation story.

If you feel like model shopping, the Hugging Face Open LLM Leaderboard is a good place to start.

And if all of this has given you a sense of dread and panic, just use GPT-4 and get some sleep.

Model Configuration and Parameters

from langchain.llms import OpenAI
llm = OpenAI(temperature=0,max_tokens=400)

While LLMs are pre-trained, there are some common configuration parameters that can impact the model’s output. The actual parameter names and behavior vary a bit across models, but generally there are three that can help shape the outputs of your results and that you should be aware of in testing:

Model Temperature: Basically, how much “noise” should the model start with and how varied should responses be. A temperature of 0 means responses will be more consistent if a bit duller. A temp of 1 means more variability but also higher potential of inaccuracy. In text-to-image models like Mid-Journey, this is called Chaos.

Max Token Length: Tokens are synonymous with “word count” in text-to-text use cases. Token counts matter for a couple reasons:

They dictate response length; important if you have UIs that need a a certain text length.
Most paid models will charge you on a per token basis, often the sum of your Input tokens (prompt length) and Output tokens (response length).
Some models have maximum transaction sizes, including your input prompt and the resulting output.

Presence Penalty: Less common. Scale -2 to 2. This says whether responses should try to avoid repeating the same phrase. A higher presence penalty will reduce repetition.

Model configuration is something your engineering team can tweak to vary responses. Their impact is relatively limited compared to things like prompt engineering and fine-tuning, but it can be a good place to start varying your model since it is easy to tweak.

Prompts

So much has been written on prompting techniques, I’m loathe to even include it here. But the truth is that your prompts are likely going to be the main driver of success and failure in your initial product development — they really do matter.

I’m not prompt expert myself, and there’s still a lot of emerging research on new prompt techniques.

Basic prompts break into two categories: Zero Shot and Few Shot.

Zero-shot prompts give the LLM no context before asking it a question. This is like the first time you interact with a new chat on ChatGPT. It’s simple and easy to implement, but responses may lack expertise and specificity.

User Query: What is the airspeed velocity of an unladen swallow? 
GPT-4 Response: The question "What is the air-speed velocity of an unladen swallow?" is a famous comedic line from the 1975 film "Monty Python and the Holy Grail." In the film, the question is posed as a sort of riddle, and the answer is never given directly.

However, if we're talking about the actual bird, the European Swallow (Hirundo rustica), researchers have estimated that the average cruising airspeed for a European Swallow is roughly 11 meters.

Few Shot prompts provide examples of the type of response you want the LLM to give.

User Prompt

You are the character of King Arthur in Monty Python's the Holy Grail. Answer questions as if you are that character. 
Q: Stop. What... is your name?
A: It is 'Arthur', King of the Britons.
Q: What... is your quest?
A: To seek the Holy Grail.
Q: What... is the air-speed velocity of an unladen swallow?
A:

Response

What do you mean? An African or European swallow?

In case you don’t get the reference. This will be important in the next set of articles.

There’s other research on what types of prompts elicit the best responses: setting a role, saying please, Chain of Thought, Chain of Density. Your mileage may vary. Optimizing your prompts is a good use of time early on, but reaches a place of diminishing returns.

My best advice on prompts is get them to a “good enough” place to get outputs in front of users, and to set up a staging/sandbox environment to experiment with new prompts.

Wrapping Up

Hopefully now you have a better understanding of what lives under the hood of developing an LLM-based application (as opposed to just using a ChatBot).

In the next article, I’ll cover more advanced techniques, like using your own data to tailor responses, shaping the outputs you get from the LLM, and chaining together queries and API calls in a pipeline to improve results.

Thanks for reading and keep building.

Was this helpful? Let me know and please share!