The base model's tokenizer was trained on English-heavy data. Hindi words are split into many subword tokens, increasing sequence length and degrading attention patterns. Fine-tuning can't fix a fundamentally mismatched vocabulary.
Extend the tokenizer vocabulary with high-frequency Hindi tokens and resize the embedding layer. Initialize new token embeddings as averages of their subword constituents. Alternatively, use a tokenizer-free approach (byte-level) or a model pre-trained on balanced multilingual data.
If the same named entities appear in both train and test sets (e.g., 'Google' appears in both), the model memorizes entity strings rather than learning contextual patterns. It fails on unseen entity mentions.
Use entity-level splits: ensure no entity string appears in both train and test. Evaluate on zero-shot entities specifically. Augment training with entity replacement (swap 'Google' with 'Acme Corp' while preserving context). Use gazetteer features as soft signals, not hard rules.
Anisotropic embedding spaces concentrate all vectors in a narrow cone, making cosine similarity artificially high for everything. Common with poorly tuned or non-fine-tuned transformer outputs.
Apply whitening/normalization to the embedding space (zero-mean, unit-variance per dimension). Fine-tune with contrastive loss (SimCSE, InfoNCE) using hard negatives. Use the [CLS] token or mean-pooling after fine-tuning, not from a raw pretrained model. Consider Matryoshka embeddings for flexible dimensionality.
Linear position interpolation compresses all positions, degrading local attention patterns that were learned at the original scale. The model can 'see' far but loses ability to reason about nearby tokens precisely.
Use NTK-aware interpolation or YaRN: scale high-frequency RoPE components less than low-frequency ones, preserving local attention while extending reach. Alternatively, use ALiBi which generalizes to longer sequences by design. Fine-tune on a mix of short and long documents.
NLP — 4 questions
Questions answered before reveal