吴恩达agentic-第一节
吴恩达教授agentic课程的第一节
What is Agentic AI and why are Agentic AI workflows so powerful?
The way that many of us use large language models or LLMs today is by prompting it to, say, write an essay for us on a certain topic X. I think of that as akin to going to a human—or in this case, going to an AI—and asking it to please type out an essay for me by writing from the first word to the last word all in one go and without ever using backspace.
It turns out that we as people, we don't do our best writing like that by being forced to write in this completely linear order—and nor do AI models. But despite the difficulty of being constrained to write in this way, our LLMs do surprisingly well.
In contrast, with an agentic workflow, this is what the process might look like: You may ask it to first write an essay outline on a certain topic, then ask if it needs to do any web research. And after doing some web research and maybe downloading some web pages, then to write the first draft and then to read the first draft and see what parts need revision or more research, and then revise the draft and so on.
This type of workflow is more akin to doing some thinking and some research, then doing some revision, and then doing some more thinking and so on. And with this iterative process, it turns out that an agentic workflow can take longer, but it delivers a much better work product.
So an agentic AI workflow is a process where an LLM-based app executes multiple steps to complete a task. In this example, you might use an LLM to write the first essay outline and then you might use an LLM to decide what search terms to type into a web search engine—or really, what search terms to call a web search API with—in order to get back relevant web pages. Based on that, you can feed the downloaded web pages into an LLM to have it write the first draft and then maybe use another LLM to reflect and decide what needs more revision.
Depending on how you design this workflow, perhaps you may even add a human-in-the-loop step where the LLM has the option to request human review, maybe of some key facts. And based on that, it may then revise the draft and this process results in a much better work output.
One of the key skills you learn in this course is how to take a complex task like writing an essay and break it down into smaller steps for agentic workflows to execute one step at a time to then get the work output that you want. And knowing how to decompose the task into steps and how to build the components to execute the individual steps well turns out to be a tricky but important skill that will determine your ability to build agentic workflows for a huge range of exciting applications.
In this course, a running example that we'll use—and something that you build alongside me—is a research agent. So here's an example of what it will look like: You can enter a research topic like “How do I build a new rocket company to compete with SpaceX?” I don't personally want to compete with SpaceX, but if you want to, you can try asking a research agent to help your background research.
So this agent starts with planning out what research to use, including calling a web search engine to download some web pages, and then to synthesize and rank findings, draft an outline, have an editor-to-agent review for coherence, and then finally generate a comprehensive markdown report, which it has done here—building a new rocket company to compete with SpaceX, with an intro, background, findings, and so on.
I think it points out appropriately that this is going to be a tough startup to build, so I'm not personally planning to do this, but if you want to tackle something like this, maybe a research agent like this could help you with some initial research. And by finding and downloading multiple sources and deeply thinking about it, this actually ends up with a much more thoughtful report than just prompting an LLM to write an essay for you would.
One of the reasons I'm excited about this is because in my work, I've ended up building quite a few specialized research agents—be it in legal documents for conflict legal compliance, or for some healthcare sectors, or some business product research areas. And so I hope that working through this example, you not only learn how to build agentic workflows for many other applications, but that some of the ideas in building research agents will be directly useful to you if you ever need to build a custom research agent yourself.
Now, one of the often discussed areas of AI agents is how autonomous are they? What you just saw here was a relatively complex, highly autonomous agentic AI workflow, but there are also other simpler workflows that are incredibly valuable.
Let's go on to the next video to talk about the degree to which agentic workflows can be autonomous, and does it give you a framework to think about how you might go about building different applications and how easy or difficult they might be.
See you in the next video.
Degrees of Autonomy
Agents can be autonomous to different degrees. A few years ago, I noticed within the AI community that there was a growing controversial debate about what is an agent, and some people are writing a paper saying I built an agent, and others will say, no, that's not really a true agent. And I felt this debate was unnecessary, which is why I started using the term agentic, because I thought if we use it as an adjective rather than a binary—it's either an agent or not—then we're going to have to acknowledge that systems can be agentic to different degrees.
Let's just call it all agentic and move on with the real work of building these systems, rather than debating, you know, is this sufficiently autonomous to be an agent or not?
I remember when I prepared a talk on agentic reasoning, one of my team members actually came to me and said, hey, Andrew, we don't need yet another word. You know, we have agent, why are you making up another word, agentic? But I decided to use it anyway. And then later on, wrote an article in a given newsletter, The Batch, and then also posted on social media, saying that instead of arguing over which word to include or exclude as being a true agent, let's acknowledge that different degrees to which systems can be agentic. And I think this helped move past the debate on what is a true agent and let us just focus on actually building them.
Some agents can be less autonomous. So take the example of writing an essay about black holes. You can have a relatively simple agent to come up with a few web search terms or web search queries. Then you can hard code in that you call a web search engine, fetch some web pages, and then use that to write an essay. And this is an example of a less autonomous agent with a fully deterministic sequence of steps. And this will work okay.
In terms of notational convention, throughout this course, I'll use the red color, as you see here on the left, to denote the user input, such as a user query in this case, or in later examples, maybe the input document into an agentic workflow. The gray boxes denote calls to an LLM, and the green boxes, like the web search and the web fetch boxes that you see here, indicate steps where other software is being used to carry out an action, such as a web search API call or executing code to fetch the contents of a website.
Then an agent can be more autonomous, where, given a request to write an essay about black holes, perhaps you let the LLM decide, does it want to do a web search, or does it want to search recent news sources, or does it want to search for recent research papers on the website archive? Based on that, maybe in this example, the LLM—not the human engineer, but the LLM—chooses, in this case, to call a web search engine, and then after that, you may let the LLM decide how many web pages does it want to fetch, or if it fetches the PDF, does it need to call a function, or also call a tool, to convert the PDF to text? And in this case, maybe it fetches its top few web pages, then it can write an essay, decide whether to reflect and improve, and maybe even go back to fetch more web pages, and then to finally produce an output.
And so even for this example of a research agent, we can see that some agents can be less autonomous, with a linear sequence of steps to be executed, determined by a programmer, and some can be more autonomous, where you trust the LLM to make more decisions, and the exact sequence of steps that happens may be even determined by the LLM, rather than in advance by the programmer.
So for less autonomous systems, you will usually have all the steps predetermined in advance, and any functions it calls, like web search, and we'll call that tool use, as you learn in the third module in this course, might be hard-coded by the human engineer, by you or me, and most of the autonomy is in what text the LLM generates.
At the end of the spectrum would be highly autonomous agents, where the agent makes many decisions autonomously, including, for example, deciding what is the sequence of steps it will carry out in order to write the essay. And there's some highly autonomous agents that can even write new functions, or sometimes create new tools that it can then execute.
And somewhere in between are semi-autonomous agents, where it can make some decisions, choose tools, but the tools are usually more predefined.
As you look at different examples in this course, you learn how to build applications anywhere on this spectrum of less to more highly autonomous, and you find that there are tons of applications in the less autonomous end of the spectrum that are very valuable being built for tons of businesses today, and at the same time, there are also applications being worked on at the more highly autonomous end of the spectrum, but those are usually less easily controllable, a little bit more unpredictable, and also a lot of active research as well to figure out how to build these more highly autonomous agents.
And with that, let's go on to the next video to dive deeper into this and to hear about some of the benefits of using agents and why they allow us to do things that just were not possible with earlier generations of base applications.
Benefits of Agentic AI
I think the one biggest benefit of agentic workflows is that it allows you to do many tasks effectively that just previously were not possible. But there are other benefits as well, including parallelism that lets you do certain things quite fast, as well as modularity that lets you combine the best of three components from many different places to build an effective workflow. Let's take a look.
My team collected some data on a coding benchmark that tests the ability of different LLMs to write code to carry out certain tasks. The benchmark used in this case is called Human Eval, and it turns out that GPT 3.5, this is a model that the first publicly available version of Chat GPT was based on, if asked to write the code directly, to just type out the computer program, gets 40% right on this benchmark. This is a positive k-metric. GPT 4 is a much better model. Its performance leaps to 67% with this also non-agentic workflow. But it turns out that as large as the improvement was from GPT 3.5 to GPT 4, that improvement is dwarfed by what you can achieve by wrapping GPT 3.5 within an agentic workflow.
Using different agentic techniques, which you'll learn about later in this course, you can prompt GPT 3.5 to write code and then maybe reflect on the code and figure out if you can improve it. And using techniques like that, you can actually get GPT 3.5 to get much higher levels of performance. And similarly, GPT 4 used in the context of an agentic workflow also does much better. So even with today's best LLMs, an agentic workflow lets you get much better performance.
In fact, what we saw in this example was the improvement from one generation of model to another, which is huge, is still not as big a difference as implementing an agentic workflow on the previous generation of model.
Another benefit of using agentic workflows is that they can parallelize some tasks and thus do certain things much faster than a human. For example, if you ask an agentic workflow to write an essay about black holes, you might be able to have three LLMs run in parallel to generate ideas for web search terms to type into the search engine. Based on the first web search, it may identify, say, three top results to fetch. And based on the second web search, it may identify a second set of web pages to fetch and so on.
And it turns out that whereas a human doing this research would have to read these nine web pages sequentially or one at a time, when you're using an agentic workflow, you can actually parallelize all nine web page downloads and then finally feed all these things into an LLM to write an essay. So even though agentic workflows do take longer than truly non-agentic workflows or by direct generation by just prompting a single time, if you were to compare this type of agentic workflow to how a human would have to go about the task, the ability to parallelize downloading lots of web pages can actually let it do certain tasks much faster than the non-parallel sequential way that a single human might process this data.
To build on this example, it turns out one of the things I often do when building agentic workflows is look at the individual components like the LLM and add or swap out components. So for example, maybe I look at the web search engine I use up here and I might decide that I want to soften a new web search engine. When building agentic workflows, there are actually multiple web search engines including Google, which you can access by a server, as well as others like Bing, DuckDuckGo, Tavily, u.com. There are actually quite a lot of options for web search engines designed for LLMs to use.
Or maybe instead of just doing three web searches, maybe on this step we can swap in a new news search engine so we can find out what's the latest news on recent breakthroughs on black hole science. And lastly, instead of using the same LLM for all of the different steps, I will often try out different large language models and maybe try out different LLM providers to see which one gives the best result for different steps of this system.
So to summarize, the main reason I use agentic workflows is it just gives much better performance on many different applications. But in addition, it can also paralyze some tasks that humans would otherwise have to do sequentially. And the modular design of many agentic workflows also lets us add or update tools and sometimes swap out models.
We've talked a lot about the key components of building agentic workflows. Let's now take a look at a range of Agentic AI applications to give you a sense of the sorts of things people are already building and the sorts of things you'll build yourself. Let's go on to the next video.
Agentic AI Applications
Let's take a look at some examples of Agentic AI applications.
One task that many businesses carry out is invoice processing. So given an invoice like this, you might want to write software to extract the most important fields, which for this application, let's say is the biller, that would be tech flow solutions, the biller address, the amount due, which is $3,000, and the due date, which looks like it is August 20th, 2025. So in many finance departments, maybe a human would look at invoices and identify the most important fields, who do we need to pay by when, and record these in a database to make sure that payment is issued in time.
If you were to implement this with an agentic workflow, you might do so like this: You write input an invoice, then call a PDF to text conversion API to turn the PDF into maybe formatted text, such as markdown text for the LLM to ingest. Then the LLM will look at the PDF and figure out, is this actually an invoice or is this some other type of document that they should just ignore? And if it is an invoice, then it will pull up the required fields as well as use an API or use a tool to update the database in order to save the most important fields in the database records.
So one aspect of this agentic workflow is that there is a clear process to follow, is identify the required fields and record in the database. Tasks like these with a clear process you want followed tend to be maybe easier for agentic workflows to carry out because it leads to a relatively step-by-step way to reliably carry out this task.
Here's another example, maybe just a little bit harder: So if you want to build an agent to respond to basic customer order inquiries, then the steps might be to extract the key information, so figure out what exactly did the customer order, what's the customer's name, then look up the relevant customer records, and then finally draft a response for human to review before the email response is sent to the customer.
So again, there's a clear process here and we will implement this step-by-step, where we take the email, feed it to an LLM to verify or to extract the order details, and assuming the customer email is about an order, the LLM might then choose to call an order's database to then pull up that information. That information then goes to the LLM to then draft an email response, and the LLM might choose to use a request review tool that, say, puts this draft email from the LLM into queue for humans to review, so they can then be sent out after a human has reviewed and approved it.
Customer order inquiry agents like these are being built and deployed in many businesses today.
To look at a more challenging example, if you want to build a customer service agent to respond not just to questions about an order they place, but to respond to a more general set of questions, anything a customer may ask, and maybe the customer will ask, do you have any black jeans or blue jeans? And to answer this question, you need to maybe make multiple API calls to your database to first check the inventory for black jeans, then check inventory for blue jeans, and then respond to the customer.
So this is an example of a more challenging query, where given a user input, you actually have to plan out what is the sequence of database queries to check for inventory. Or if a user asks, I'd like to return the beach towel I bought, then to answer this, maybe we need to verify that the customer actually bought a beach towel, and then double check the return policy. Maybe our set returns only 30 days within the date of purchase, and only the towel was unused. And if return is allowed, then have the agent issue a return packing slip, and also set the database record to return pending.
So in this example, if the required steps to process the customer requests are not known ahead of time, then it results in a more challenging process, where the LLM base application has to decide for itself that these are the three steps needed in order to respond appropriately to this task. But you learn about some of the latest work on how to approach this type of problem too.
And to give one last example of maybe an especially difficult type of agent to build, there's a lot of work on computer use by agents, in which agents will attempt to use a web browser and read a web page to figure out how to carry out a complex task. In this example, I've asked an agent to check whether seats are available on two specific United Airlines flights from San Francisco to Washington DC, or the DCA airport. The agent has access to a web browser they can use to carry out this task.
And in the video here, you can see it navigating the United website independently, clicking on page elements and filling in the text fields on the page to carry out the search that I requested. As it works, the agent reasons over the content of the page to figure out the actions it needs to take to complete the task, and what it should do next.
In this case, there's some trouble checking flights on the United site, and instead decides to navigate to the Google Flights website to search for available flights. On the Google Flight, you see here it finds several flight options that match the user's query, and the agent then picks one and is taken back to the United website, where it looks like it's now on the correct web page, and so is able to determine that yes, there are seats available on the flights that I asked about.
So computer use is an exciting cutting-edge area of research right now, and many companies are trying to get computer use agents to work. While the agent you saw here did eventually figure out the answer, I often see agents having trouble using web browsers well. For example, if a web page is slow to load, an agent may fail to understand what's going on, and many web pages are still beyond agents' abilities to pause or to read accurately. But I think computer use agents, even though not yet reliable enough to use mission-critical applications today, are an exciting and important area of future development.
So when I'm considering building Agentic AI workflows, the tasks that are easier will tend to be ones where there is a clear step-by-step process, or if a business already has a standard procedure, a standard offering procedure to follow, and then it can be quite a lot of work to take that procedure and codify it up in an AI agent, but that tends to lead to easier implementations.
One thing that makes it easier is if you are using text-only assets, because LLM/language models have grown up really processing text, and if you need to process other input modalities, it may well be doable, but it maybe gets a little bit harder. And on the harder end of the spectrum, if the steps are not known ahead of time of what's needed to carry out a task, like you saw for the more advanced customer service agent, then the agent may need to plan or solve as you go, and this tends to be harder and more unpredictable and less reliable. And then as mentioned, if it needs to accept rich multi-modal inputs such as sound, vision, audio, that also tends to be less reliable than the only header process text.
So I hope that gives you a sense of the types of applications you might build with agentic workflows. When implementing one of these things yourself, one of the most important skills is to look at a complex workflow and figure out what are the individual steps so you can implement an agentic workflow to execute those steps one at a time.
In the next video, we'll talk about task decomposition, that is, given a complex thing you want to do, like write a research report or have a customer agent get back to customers, how do you break that down into discrete steps to try to implement an agentic workflow? Let's go see that in the next video.
Task Decomposition: Identifying the steps in a workflow
People and businesses do a lot of stuff. How do you take this useful stuff that we do and break it down into discrete steps for the agentic workflow to follow? Let's take a look.
Take the example of building a research agent. If you want an AI system to write an essay on a topic X, one thing you could do is prompt an LLM to have it generate an output directly. But if you were to do this for topics that you want deeply researched, you may find that the LLM output covers only the surface level points, or maybe covers only the obvious facts, but doesn't go as deep into the subject as you want it to.
In this case, you might then reflect on how you as a human would write an essay on a certain topic. Would you just sit down and start writing, or would you take multiple steps, such as first write an essay outline, and then search the web, and then based on the input from the web search, write the essay.
As I take a task and decompose it into steps, one question I'm always asking myself is, if I look at these steps one, two, and three, can each of them be done either by an LLM, or by a short piece of code, or by a function call, or by a tool. In this case, I think an LLM can maybe write a decent outline on many topics that I would want it to help me think through. So, say probably okay on the first step, and then I know how to use an LLM to generate search terms to search the web. So, I would say the second step is also doable, and then based on web search, I think an LLM could input the web search results and write an essay. And so, this would be a reasonable first attempt at an agentic workflow for writing an essay that goes deeper than just direct generation.
But if I were to then implement this agentic workflow and look at the results, maybe you find that the results still aren't good enough. It's still not yet as deeply thoughtful. Maybe the essays feel a little bit disjointed. This has actually happened to me. I once built a research agent using this workflow, but when I read the output, it felt a bit disjointed. You know, the start of the article didn't feel completely consistent with the middle, didn't feel completely consistent with the end.
In this case, what you might do is then reflect on how you would change the workflow if you as a human found that the essay is a little bit disjointed. One thing you could do is take the third step and further decompose, write the essay into additional steps. So, instead of writing the essay on one go, you might instead have it write the first draft, and then consider what parts need revision, and then revise the draft. And this would be how I as a human might go about it, to not just write the final essay at my first attempt, but write the first draft and then read over it, which is another step that the LLM is pretty decent at. And then based on my own critique of my own essay, I'll revise the draft.
So to recap, I started off with direct generation, just one step, decided it wasn't good enough, and so broke that down into three steps, and then maybe decided that still isn't good enough, and took one of the steps and further broken it down or decomposed it into three more steps, resulting in this more complex, richer process for generating an essay. And depending on how satisfied you are with the results of this process, you may choose to even modify this essay generation process further.
Let's look at the second example of how to decompose complex tasks into smaller steps. Take the example of responding to basic customer order inquiries. The first step that a human customer specialization might carry out might be to first extract the key information, such as who is this email from, what did they order, and what is the order number. And these are things that an LLM could do. So I could just say, let's have an LLM do that. The second step would be to then find the relevant customer records. So to write and generate the relevant database queries to pull up the order of what the customer had ordered and when I shipped and so on. I think an LLM with the ability to call a function to query the orders database should be able to do that. And lastly, having pulled up the customer record or the customer order record, I might then write and send a response back to the customer. And I think with the information we pulled up, this third step is also doable with an LLM if I give the option to call an API to send an email.
So this would be another example of taking a task of responding to customer email and breaking it down into three individual steps where I can look at each of these steps and say, yep, I think an LLM or one LLM with the ability to call a function to query a database or send an email should be able to do that.
Just one last example for the invoice processing. After a PDF invoice has been converted to text, the first step is to pull out the required information, the name of the biller, the address, the due date, the amount due, and so on. And now I should be able to do that. And then if I want to check that the information was extracted and save it in a new database entry, then I think an LLM should be able to help me call a function to update the database record. And so to implement this, we implement an agentic workflow to carry out basically these two steps.
When building agentic workflows, I think of myself as having a number of building blocks. One important building block would be large language models or maybe large multimodal models if I want to try to process images or audio as well. And LLMs are good at generating text, deciding what to call, maybe extracting information. For some highly specialized tasks, I might also use some other AI models, such as an AI model for converting a PDF to text or for text-to-speech or for image analysis.
In addition to AI models, I also have access to a number of software tools, including different APIs that I can call to do voice search, to get maybe real-time weather data, to send emails, check calendar, and so on. And I might also have tools to retrieve information, to pull up data from a database, or to invent RAG or retrieval augmented generation, where I can look up a large text database and find the most relevant text. Or I might also have tools to execute code. And this is a tool that lets an LLM write code and then run the code on your computer to do a huge range of things.
In case some of these tools seem a bit foreign to you, don't worry about it. We'll go through the most important tools in much greater detail in a later module. But I think of a lot of my work when I'm building an agent workflow as looking at the work that the person or business is doing and then trying to figure out with these building blocks, how can I sequence these building blocks together in order to carry out the tasks that I want my system to carry out. And this is why having a good understanding of what building blocks are available, which I hope you have a better sense of by the end of this course as well, will allow you to better envision what agentic workflows you can build by combining these building blocks together.
So to summarize, one of the key skills in building agentic workflows is to look at a bunch of stuff that maybe someone does and to identify the discrete steps that it could be implemented with. And when I'm looking at the individual discrete steps, one question I'm always asking myself is, can this step be implemented with either an LLM or with one of the tools such as an API or a function call that I have access to? And in case the answer is no, I'll then often ask myself, how would I as a human do this step? And is it possible to decompose this further or break this down into even smaller steps that then maybe is more amenable to implementation with an LLM or with one of the software tools that I have?
So I hope this gives you a rough sense of how to think about task decomposition. In case you feel like you don't fully have it yet, don't worry about it. We'll go through many more examples in this course and you have a much better understanding of this by the end of this course. But it turns out that as you build agentic workflows, you find that often you build an initial task decomposition, initial agentic workflow, and then you want to keep on iterating and improving on it quite a few times until it delivers the level of performance that you want.
And to drive this improvement process, which I found important for many projects, one of the key skills is to know how to evaluate your agentic workflow. So in the next video, we'll talk about evaluations or evals and discrete key components, how you can build, and then also keep on improving your workflows to get the performance that you want. Let's talk about evals in the next video.
Evaluation agentic AI (evals)
I've worked with many different teams on building agentic workflows, and I've found that one of the biggest predictors for whether someone is able to do it really well versus be less efficient at it is whether or not they're able to drive a really disciplined evaluation process. So, your ability to drive evals for your agentic workflow makes a huge difference in your ability to build them effectively.
In this video, we'll take a quick overview of how to build evals, and this is a subject that we'll actually go into much deeper in a later module in this course. So, let's take a look.
After building an agentic workflow like this one for responding to customer order inquiries, it turns out that it's very difficult to know in advance what are the things that could go wrong. And so, rather than trying to build evaluations in advance, what I recommend is you just look for the outputs and manually look for things that you wish it was doing better.
For example, maybe you read a lot of outputs and find that it is unexpectedly mentioning your competitors more than it should. Many businesses don't want their agents to mention competitors because it just creates an awkward situation. And if you read some of these outputs, maybe you find that it sometimes says, I'm glad you shopped with us. We're much better than our competitor, ComproCo. Or maybe sometimes they say, sure, it should be fun. Unlike RivalCo, we make returns easy. And you may look at this and go, gee, I really don't want this to mention competitors.
This is an example of a problem that is really hard to anticipate in advance of building this agentic workflow. So, the best practice is really to build it first and then examine it to figure out where it is not yet satisfactory, and then to find ways to evaluate as well as improve the system to eliminate the ways that it is still not yet satisfactory.
Assuming your business considers it an error or a mistake to mention competitors in this way, then as you work on eliminating these competitor mentions, one way to track progress is to add an evaluation or an eval to track how often this error occurs. So, if you have a named list of competitors like ComproCo, RivalCo, the other co, then you can actually write code to just search in your own output for how often it mentions these competitors by name and count up as a number, as a fraction of the overall responses, how frequently it mistakenly mentions competitors.
One nice thing about the problem of competitor mentions is it's an objective metric, meaning either the competitor was mentioned or not. And for objective criteria, you can write code to check for how often this specific error occurs.
But because LLMs output free text, there are also going to be criteria by which you want to evaluate this output that may be more subjective and where it's harder to just write code to output a black and white score. In this case, using an LLM as a judge is a common technique to evaluate the output. So, for example, if you're building a research agent to do research on different topics, then you can use another LLM and prompt it to maybe, say, assign the following essay a quality score between 1 and 5, where 1 is the worst and 5 is the best essay.
Here, I'm using a Python expression to mean copy-paste the generated essay into this. So, you can prompt the LLM to read the essay and assign it a quality score. Then I'm going to ask the research agent to write a number of different research reports, for example, on recent developments in black hole science or using robots to harvest fruit. And then in this example, maybe the judge LLM assigns the essay on black holes a score of 3, the essay on robot harvesting a score of 4, and as you work on improving your research agent, hopefully you see these scores go up over time.
It turns out, by the way, that LLMs are actually not that good at these 1 to 5 scale ratings. You can give it a shot, but I personally tend not to use this technique that much myself. But in a later module, you'll learn some better techniques to have an LLM output more accurate scores than asking it to output scores on a 1 to 5 scale, although some people will do this, maybe an initial cut as an LLM-as-judge type of eval.
Just to give a preview of some of the Agentic AI evals you'll learn about later in this course, you've already heard me talk about how you can write codes to evaluate objective criteria, such as did it mention a competitor or not, or use an LLM as a judge for more subjective criteria such as what's the quality of this essay. But later, you learn about two major types of evals. One is end-to-end, where you measure the output quality of the entire agent, as well as component level evals, where you might measure the quality of the output of a single step in the agentic workflow. It turns out that these are useful for driving different parts of your development process.
One thing I do a lot as well is just examine the intermediate outputs, or sometimes we call these the traces of the LLM, in order to understand where it is falling short of my expectations. And we call this error analysis, where we just read through the intermediate outputs of every single step to try to spot opportunities for improvement. And it turns out being able to do evals and error analysis is a really key skill.
So we have much more to say about this in the fourth module in this course.
We're nearly to the end of this first module. Before moving on, I just want to share with you what I think are the most important design patterns for building agentic workflows. Let's go take a look at that in the next video.
Agentic Design Patterns
We build agentic workflows by taking building blocks and putting them together to sequence out these complex workflows. In this video, I'd like to share with you a few of the key design patterns, which are patterns for how you can think about combining these building blocks into more complex workflows. Let's take a look.
I think four key design patterns for building agentic workflows are reflection, tool-use, planning, and multi-agent collaboration. Let me briefly go over what they mean, and then we'll actually go through most of these in-depth later in this course as well.
The first of the major design patterns is reflection. So I might go to an LLM agent and ask it to write code, and it turns out that an LLM might then generate code like this. It defines here a Python function to do a certain task. I could then construct a prompt that looks like this. I can say, here's code intended for a certain task, and then copy-paste whatever the LLM had just output back into this prompt. And then I ask it to check the code carefully for correctness, style, and efficiency, and give constructive criticism. And it turns out that the same LLM model prompted this way may be able to point out some problems with the code. And if I then take this critique and feed it back to the model to say, looks like this is a bug, could you change the code to fix it? Then it may actually come with a better version of the code.
To give a preview of tool use, if you're able to run the code and see where the code fails, then feeding that back to the LLM can also cause it to be able to iterate and generate a much better, say, v3 version 3 of the code. So reflection is a common design pattern where you can ask the LLM to examine its own outputs or maybe bring in some external sources of information, such as run the code and see if it generates any error messages, and use that as feedback to iterate again and come up with a better version of its output. And this design pattern isn't magic. It does not result in everything working 100% of the time. But sometimes it can be a nice bump in the performance of your system.
Now, I've drawn this as if it was a single LLM that I'm prompting, but to foreshadow multi-agent workflows, you can also imagine instead of having the same model critique itself, you can imagine having a critique agent. And all that is, is an LLM that's been prompted with instructions like, your role is to critique code, here's code intended for a task, check the code carefully, and so on. And the second critique agent, maybe point out errors or run unit tests. And by having two simulated agents where each agent is just an LLM prompted to take on a certain persona, you can have them go back and forth to iterate to get a better output.
In addition to reflection pattern, the second important design pattern is tool use. Where today, LLMs can be given tools, meaning functions that they can call in order to get work done. For example, if you ask an LLM, what's the best coffee maker according to reviewers, and you give it a web search tool, then it can actually search the internet to find much better answers. Or a code execution tool. If you ask a math question like, if I invest $100 in compound interest, what do I have at the end? It can then write code and execute code to compute an answer. Today, different developers have given LLMs many different tools for everything from math or data analysis to gather information by fetching things from the web or for various databases, to interface with productivity apps like email, calendar, and so on, as well as to process images and much more. And the ability of an LLM to decide what tools to use, meaning what functions to call, that lets the model get a lot more done.
The third of the four design patterns is planning. This is an example from a paper called Hugging GPT, in which if you ask a system to please generate an image where a girl is reading a book and a pose is the same as a boy in the image, then please describe the new image in your voice. Then a model can automatically decide that to carry out this task, it first needs to find a pose determination model to figure out the pose of the boy. Then to pose the image, to generate a picture of a girl and image the text, and then finally text the speech. And so in planning, an LLM decides what is the sequence of actions it needs to take. In this case, it is a sequence of API calls so that it can then carry out the right sequence of steps in the right order in order to carry out the task. So rather than the developer hard coding the sequence of steps in advance, this actually lets the LLM decide what are the steps to take. Agents that plan today are harder to control and somewhat more experimental, but sometimes they can give really delightful results.
And then finally, multi-agent workflows. Just as a human manager might hire a number of others to work together on a complex project, in some cases it might make sense for you to hire a set of multiple agents, maybe each of which specializes in a different role, and have them work together to accomplish a complex task. The picture you see here on the left is taken from a project called ChatDev, which is a software framework created by Chen Qian and collaborators. In ChatDev, multiple agents with different roles, like chief executive officer, programmer, tester, designer, and so on, collaborate together as if they were a virtual software company and can collaboratively complete a range of software development tasks.
Let's consider another example. If you want to write a marketing brochure, maybe you think of hiring a team of three people, such as a researcher to do online research, a marketer to write the marketing text, and then finally an editor to edit and polish the text. And so in a similar way, you might consider building a multi-agent workflow in which you have a simulated research agent, a simulated marketer agent, and a simulated editor agent that then come together to carry out this task for you. Multi-agent workflows are more difficult to control since you don't always know ahead of time what the agents will do, but research has shown that they can result in better outcomes for many complex tasks, including things like writing biographies or deciding on chess moves to make in the game. You learn more about multi-agent workflows later in this course as well.
And so with that, I hope you have a sense of what agentic workflows can do, as well as of what are the key challenges of finding building blocks and putting them together, maybe via these design patterns, in order to implement an agentic workflow. And of course, also developing eval so you can see how well your system is doing and keep on improving on it. In the next module, I'd like to share with you a deep dive into the first of these design patterns, that is reflection, and you find that it's a maybe surprisingly simple to implement technique that can give the performance of your system sometimes a very nice bump. So let's go on to the next module to learn about the reflection design pattern.