Frontier models generate reasoning data that often already contains benchmark knowledge, so without overlap detection you're silently training on your evaluation data — inflating scores without real capability gains.
Implement decontamination: check for n-gram overlap between synthetic data and all evaluation sets. Use diverse generation prompts to avoid mode collapse. Validate on held-out human-written test sets that the teacher model hasn't seen. Track capability vs. memorization with perturbation tests.
Feeding the full accessibility tree into an LLM creates a state-space explosion where invisible artifacts dominate attention and latency scales with page complexity rather than task complexity.
Filter the DOM to only visible, interactive elements. Use spatial hashing to reduce the search space. Implement a two-stage approach: first identify the relevant region (using screenshot + vision), then operate on the filtered sub-tree. Cache DOM structure across steps.
Softmax attention mathematically forces biased tokens into the reasoning path. System prompts are soft constraints — they compete with user tokens for attention weight, and sufficiently crafted inputs can override them.
Layer defenses: (1) input classifier to detect intent before it reaches the model, (2) output filter to catch policy violations post-generation, (3) fine-tune the model on refusal examples for your specific domain, (4) use structured generation to constrain output format.
Tool traces, dead-end reasoning, and irrelevant snippets silently dominate the attention budget, leaving the model with less signal when it actually needs to reason.
Implement context compaction: summarize completed tool results, prune dead-end traces, and use a sliding window that keeps only the most recent N observations plus the original goal. Consider hierarchical summarization where intermediate results are compressed before re-injection.
Even with massive context windows, transformer attention degrades when critical constraints sit among thousands of irrelevant tokens. The model attends strongly to the beginning and end, losing signal in the middle.
Re-rank retrieved documents so the most relevant are at the start. Reduce total context by filtering aggressively (quality > quantity). Use map-reduce summarization: process each doc independently, then combine summaries. Consider recursive retrieval with focused sub-queries.
An agent that performs on curated demonstrations but collapses after one perturbation has learned sequence replay, not policy robustness. Static benchmarks don't capture distribution shift or adversarial inputs.
Use dynamic evaluation: randomize task parameters, introduce perturbations, test with out-of-distribution inputs. Build a live eval pipeline that continuously samples real user interactions. Track failure modes by category rather than aggregate accuracy.
LLM Agents — 6 questions
Questions answered before reveal