LLM Agents

Question 1 of 6
expert LLM Agents
The Synthetic Dataset Trap
You generate synthetic training data using GPT-4 to fine-tune a smaller model. The fine-tuned model scores great on your eval set but badly on new benchmarks. Why?
data contamination synthetic data decontamination capability vs memorization

The Trap

Frontier models generate reasoning data that often already contains benchmark knowledge, so without overlap detection you're silently training on your evaluation data — inflating scores without real capability gains.

Correct Approach

Implement decontamination: check for n-gram overlap between synthetic data and all evaluation sets. Use diverse generation prompts to avoid mode collapse. Validate on held-out human-written test sets that the teacher model hasn't seen. Track capability vs. memorization with perturbation tests.

Follow-up Questions

  • How do you detect subtle contamination beyond exact n-gram matches?
  • What's the minimum synthetic-to-organic data ratio you'd recommend?
Back to Categories
hard LLM Agents
The DOM Context Trap
Your web browsing agent is given the full accessibility tree of a page but takes 30+ seconds per action and often clicks the wrong element. How do you fix this?
DOM pruning state-space explosion multi-modal agents spatial hashing

The Trap

Feeding the full accessibility tree into an LLM creates a state-space explosion where invisible artifacts dominate attention and latency scales with page complexity rather than task complexity.

Correct Approach

Filter the DOM to only visible, interactive elements. Use spatial hashing to reduce the search space. Implement a two-stage approach: first identify the relevant region (using screenshot + vision), then operate on the filtered sub-tree. Cache DOM structure across steps.

Follow-up Questions

  • How would you handle dynamically loaded content (infinite scroll)?
  • What are the trade-offs of screenshot-based vs DOM-based navigation?
expert LLM Agents
The Semantic Leakage Trap
Your chatbot has a system prompt forbidding financial advice, but users keep getting it to discuss stock picks by rephrasing questions. Stronger system prompts don't help. Why?
prompt injection attention mechanics defense in depth output filtering

The Trap

Softmax attention mathematically forces biased tokens into the reasoning path. System prompts are soft constraints — they compete with user tokens for attention weight, and sufficiently crafted inputs can override them.

Correct Approach

Layer defenses: (1) input classifier to detect intent before it reaches the model, (2) output filter to catch policy violations post-generation, (3) fine-tune the model on refusal examples for your specific domain, (4) use structured generation to constrain output format.

Follow-up Questions

  • How does fine-tuning on refusals differ from adding more system prompt text?
  • What metrics would you use to evaluate guardrail effectiveness?
hard LLM Agents
The Context Pollution Trap
Your LLM agent uses a ReAct loop with tool calls. After 8 steps, answer quality drops sharply even though the context window isn't full. What's happening and how do you fix it?
attention budget context compaction ReAct loop sliding window

The Trap

Tool traces, dead-end reasoning, and irrelevant snippets silently dominate the attention budget, leaving the model with less signal when it actually needs to reason.

Correct Approach

Implement context compaction: summarize completed tool results, prune dead-end traces, and use a sliding window that keeps only the most recent N observations plus the original goal. Consider hierarchical summarization where intermediate results are compressed before re-injection.

Follow-up Questions

  • How would you measure attention waste quantitatively?
  • What's the trade-off between aggressive pruning and losing context?
hard LLM Agents
The Lost-in-the-Middle Trap
You built a RAG system that retrieves 20 documents and stuffs them into a 128K context window. Retrieval recall is 95%, but the final answers miss key facts from the middle documents. What's wrong?
positional bias RAG re-ranking map-reduce summarization

The Trap

Even with massive context windows, transformer attention degrades when critical constraints sit among thousands of irrelevant tokens. The model attends strongly to the beginning and end, losing signal in the middle.

Correct Approach

Re-rank retrieved documents so the most relevant are at the start. Reduce total context by filtering aggressively (quality > quantity). Use map-reduce summarization: process each doc independently, then combine summaries. Consider recursive retrieval with focused sub-queries.

Follow-up Questions

  • How does RoPE (rotary positional encoding) affect this problem?
  • Would you prefer fewer high-quality chunks or many lower-quality ones?
hard LLM Agents
The Static Benchmark Trap
Your agent scores 92% on a curated evaluation suite but fails on real user requests. What explains the gap?
evaluation robustness distribution shift dynamic benchmarks failure taxonomy

The Trap

An agent that performs on curated demonstrations but collapses after one perturbation has learned sequence replay, not policy robustness. Static benchmarks don't capture distribution shift or adversarial inputs.

Correct Approach

Use dynamic evaluation: randomize task parameters, introduce perturbations, test with out-of-distribution inputs. Build a live eval pipeline that continuously samples real user interactions. Track failure modes by category rather than aggregate accuracy.

Follow-up Questions

  • How do you prevent benchmark contamination in training data?
  • What's a good ratio of synthetic to organic test cases?

Quiz Complete!

LLM Agents — 6 questions

0 / 6

Questions answered before reveal

Retry Quiz All Categories Browse Projects