NLP

Question 1 of 4
hard NLP
The Tokenizer Mismatch Trap
You fine-tune a multilingual model on code-switched text (English + Hindi). The model outputs garbled Hindi tokens despite good English performance. The training data is clean. What's wrong?
tokenizer coverage vocabulary extension multilingual NLP subword segmentation

The Trap

The base model's tokenizer was trained on English-heavy data. Hindi words are split into many subword tokens, increasing sequence length and degrading attention patterns. Fine-tuning can't fix a fundamentally mismatched vocabulary.

Correct Approach

Extend the tokenizer vocabulary with high-frequency Hindi tokens and resize the embedding layer. Initialize new token embeddings as averages of their subword constituents. Alternatively, use a tokenizer-free approach (byte-level) or a model pre-trained on balanced multilingual data.

Follow-up Questions

  • How do you decide which tokens to add to the vocabulary?
  • What's the fertility rate and why does it matter?
Back to Categories
medium NLP
The Named Entity Leakage Trap
Your NER model achieves 97% F1 on the test set but drops to 60% on new documents from the same domain. The test set was properly held out. What's going on?
entity-level splits memorization vs generalization entity augmentation NER evaluation

The Trap

If the same named entities appear in both train and test sets (e.g., 'Google' appears in both), the model memorizes entity strings rather than learning contextual patterns. It fails on unseen entity mentions.

Correct Approach

Use entity-level splits: ensure no entity string appears in both train and test. Evaluate on zero-shot entities specifically. Augment training with entity replacement (swap 'Google' with 'Acme Corp' while preserving context). Use gazetteer features as soft signals, not hard rules.

Follow-up Questions

  • How do you handle partial entity overlap (e.g., 'New York' vs 'New York Times')?
  • What evaluation metrics best capture NER generalization?
hard NLP
The Embedding Collapse Trap
Your sentence embedding model produces high similarity scores (>0.85) for completely unrelated sentences. Cosine similarity is nearly useless for retrieval. What happened?
anisotropy contrastive learning SimCSE embedding normalization

The Trap

Anisotropic embedding spaces concentrate all vectors in a narrow cone, making cosine similarity artificially high for everything. Common with poorly tuned or non-fine-tuned transformer outputs.

Correct Approach

Apply whitening/normalization to the embedding space (zero-mean, unit-variance per dimension). Fine-tune with contrastive loss (SimCSE, InfoNCE) using hard negatives. Use the [CLS] token or mean-pooling after fine-tuning, not from a raw pretrained model. Consider Matryoshka embeddings for flexible dimensionality.

Follow-up Questions

  • How do hard negatives improve contrastive learning?
  • What's the difference between in-batch and mined hard negatives?
expert NLP
The Positional Encoding Cliff
Your model handles 4K-token documents well but falls apart at 8K tokens, even though you extended the context window using interpolation. Why?
RoPE NTK interpolation YaRN ALiBi context extension

The Trap

Linear position interpolation compresses all positions, degrading local attention patterns that were learned at the original scale. The model can 'see' far but loses ability to reason about nearby tokens precisely.

Correct Approach

Use NTK-aware interpolation or YaRN: scale high-frequency RoPE components less than low-frequency ones, preserving local attention while extending reach. Alternatively, use ALiBi which generalizes to longer sequences by design. Fine-tune on a mix of short and long documents.

Follow-up Questions

  • How does NTK interpolation differ from linear interpolation?
  • What's the maximum reliable extension factor for each method?

Quiz Complete!

NLP — 4 questions

0 / 4

Questions answered before reveal

Retry Quiz All Categories Browse Projects