When defects are smaller than the effective receptive field of early conv layers, they get averaged out during downsampling. The model literally can't see them by the time features reach the classifier.
Use feature pyramid networks (FPN) to preserve multi-scale features. Add skip connections from early layers. Use dilated/atrous convolutions to increase receptive field without losing resolution. Consider a two-stage approach: detect regions of interest at full resolution, then classify.
Fine-tuning the ViT won't break the ceiling because the bottleneck isn't perception — it's the missing symbolic bridge between visual claims and language reasoning. Scaling the encoder amplifies features the LLM can't use.
Add an explicit alignment module (Q-Former, linear projection with learned queries) between vision and language. Use contrastive pre-training (CLIP-style) to align visual and text embeddings. Fine-tune the projection layer, not the frozen encoder. Consider chain-of-thought visual reasoning.
Some augmentations violate domain invariants. Flipping a chest X-ray swaps the heart to the wrong side — creating impossible training examples that confuse the model about anatomical structure.
Use domain-aware augmentation: only apply transforms that preserve label validity. For medical imaging, elastic deformations and intensity scaling are safe; horizontal flips and extreme rotations are not. Validate augmented samples with domain experts. Use learned augmentation (AutoAugment) constrained to valid transforms.
Standard object detectors downsample aggressively (32x stride) for speed, destroying small-object features before they reach detection heads.
Use SAHI (Slicing Aided Hyper Inference) — tile the image into overlapping patches, run detection on each, merge results with NMS. Train with mosaic augmentation at higher resolution. Add extra detection heads for small scales. Consider using a super-resolution preprocessor for tiny regions.
Computer Vision — 4 questions
Questions answered before reveal