Computer Vision Quiz — AI Interview Prep

hard Computer Vision

The Tiny Object Trap

You're building an aerial drone inspection system. Your YOLO model detects large structures perfectly but misses cracks and corrosion. How do you fix this?

SAHI tiled inference stride vs resolution mosaic augmentation

The Trap

Standard object detectors downsample aggressively (32x stride) for speed, destroying small-object features before they reach detection heads.

Correct Approach

Use SAHI (Slicing Aided Hyper Inference) — tile the image into overlapping patches, run detection on each, merge results with NMS. Train with mosaic augmentation at higher resolution. Add extra detection heads for small scales. Consider using a super-resolution preprocessor for tiny regions.

Follow-up Questions

What's the computational cost of tiled inference vs. higher-res input?
How do you handle duplicate detections at tile boundaries?

Back to Categories

medium Computer Vision

The Data Augmentation Trap

You aggressively augment your medical imaging dataset with rotations, flips, and color jitter. Validation accuracy drops. What went wrong?

domain-aware augmentation label preservation AutoAugment medical imaging

The Trap

Some augmentations violate domain invariants. Flipping a chest X-ray swaps the heart to the wrong side — creating impossible training examples that confuse the model about anatomical structure.

Correct Approach

Use domain-aware augmentation: only apply transforms that preserve label validity. For medical imaging, elastic deformations and intensity scaling are safe; horizontal flips and extreme rotations are not. Validate augmented samples with domain experts. Use learned augmentation (AutoAugment) constrained to valid transforms.

Follow-up Questions

How do you decide which augmentations are safe for a new domain?
What role does test-time augmentation (TTA) play?

hard Computer Vision

The Receptive Field Trap

Your defect detection model achieves 96% accuracy on test data but misses small scratches (< 10px) in production images. The training data includes these defects. What's the issue?

receptive field FPN multi-scale features dilated convolutions

The Trap

When defects are smaller than the effective receptive field of early conv layers, they get averaged out during downsampling. The model literally can't see them by the time features reach the classifier.

Correct Approach

Use feature pyramid networks (FPN) to preserve multi-scale features. Add skip connections from early layers. Use dilated/atrous convolutions to increase receptive field without losing resolution. Consider a two-stage approach: detect regions of interest at full resolution, then classify.

Follow-up Questions

How do you calculate the effective receptive field of a CNN?
Why does Faster R-CNN still beat YOLO for tiny objects?

expert Computer Vision

The Vision Encoder Scaling Trap

Your multimodal model (ViT + LLM) plateaus at 20% accuracy on visual QA despite scaling the ViT from 300M to 1B parameters. What's the bottleneck?

vision-language alignment Q-Former CLIP multimodal bottleneck

The Trap

Fine-tuning the ViT won't break the ceiling because the bottleneck isn't perception — it's the missing symbolic bridge between visual claims and language reasoning. Scaling the encoder amplifies features the LLM can't use.

Correct Approach

Add an explicit alignment module (Q-Former, linear projection with learned queries) between vision and language. Use contrastive pre-training (CLIP-style) to align visual and text embeddings. Fine-tune the projection layer, not the frozen encoder. Consider chain-of-thought visual reasoning.

Follow-up Questions

How does LLaVA's simple linear projection work despite its simplicity?
What's the role of visual instruction tuning?

Computer Vision

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

Quiz Complete!