吴恩达agentic-第二节
吴恩达教授agentic课程的第2节
Reflection Design Pattern
Reflection to Improve Outputs of a Task
Reflection lets an LLM improve its own outputs, just as humans reflect and revise their drafts.
Example:
- Write an email → produce version 1 (v1)
- Reflect → find unclear phrasing, typos, missing signature
- Revise → produce version 2 (v2)
LLMs can follow the same loop:
- Generate v1 (email, code, etc.)
- Pass v1 into another prompt
- Reflect and improve to create v2
Different models can handle each step:
- Generation → creative or direct model
- Reflection → reasoning or analytical model
External feedback increases effectiveness:
- Example: run code → capture errors → feed results back → fix bugs
Key points
- Reflection adds modest but consistent performance gains
- External information strengthens the reflection process
- Reflection ≠ perfection, but it improves clarity, accuracy, and completeness
Why Not Just Direct Generation?
Zero-shot prompting = one-step generation with no examples.
Example:
- “Write an essay about black holes.”
- “Write a Python function for compound interest.”
Reflection vs direct generation:
- Reflection often outperforms zero-shot on diverse tasks.
- Research shows reflection boosts success across models like GPT-3.5 and GPT-4.
When reflection helps
- Generating structured data (HTML, JSON)
- Multi-step instructions (e.g., how to brew tea)
- Creative generation (domain names, brand names)
Reflection prompt examples
- Check tone, facts, and clarity in emails
- Evaluate if domain names are easy to pronounce and have no negative meanings
Prompt writing tips
- Use verbs like “review,” “reflect,” “check,” “verify”
- Specify clear criteria (tone, factual accuracy, structure)
- Study prompts from good open-source implementations
Chart Generation Workflow
Reflection also improves visual outputs.
Example workflow:
- Generate chart code (Python) → produce stacked bar plot (v1)
- Feed both code + generated image to a multimodal LLM
- Reflect visually → suggest better plot (e.g., grouped bars)
- Generate improved visualization (v2)
Effective reflection prompts
- Assign clear roles (“expert data analyst”)
- Include context: code, data, and image
- Define criteria: readability, clarity, completeness
Model choice
- Generation: GPT-4o, GPT-5, etc.
- Reflection: reasoning model or visual model
Evaluating the Impact of Reflection
Reflection adds latency but often improves accuracy. Always test if it’s worth keeping.
Example: Database query evaluation
- LLM writes SQL → run without reflection
- LLM writes SQL → reflect, rewrite → run again
- Compare accuracy vs ground truth answers
Result example:
- No reflection → 87% correct
- With reflection → 95% correct
Evaluation methods
- Objective tasks: use automated tests (e.g., SQL results)
- Subjective tasks: use LLM-as-judge with rubrics
Rubric-based evaluation
- Avoid pairwise comparison bias
- Use binary (0/1) scoring for consistent evaluation
- Criteria: clear title, proper labels, correct chart type, etc.
Why this matters
- Enables prompt optimization
- Provides reproducible benchmarks
- Keeps improvements data-driven
Using External Feedback
Reflection with new information outperforms reflection using only prior context.
Performance curve:
- Zero-shot → plateaus fast
- Reflection → moderate gain
- Reflection + external feedback → strong improvement
Examples of external feedback
- Run code → capture runtime errors → feed back
- Detect banned terms → feed detection results
- Web search for fact-checking → supply verified info
- Word count validation → enforce length limits
Design pattern
- Generate → produce output
- Collect external feedback → new data
- Reflect with feedback → improved output
Reflection is a system-level loop:
- LLM learns from reality
- Developer measures and tunes
- System continuously improves
Summary
Reflection design pattern:
- Mimics human revision
- Works across text, code, and images
- Improves quality when guided by clear criteria or external feedback
Core principle Add structured feedback loops to any generation system.