Unscaled features with vastly different ranges (e.g., age 0-100 vs salary 10K-500K) create elongated contours in the loss landscape. Gradient descent oscillates along the steep dimension and crawls along the flat one.
Apply feature scaling: StandardScaler (zero mean, unit variance) for normally distributed features, MinMaxScaler for bounded features, RobustScaler for outlier-heavy data. Always fit on training data, transform on test. For tree-based models, scaling is unnecessary -- they split on thresholds, not distances.
Prompt engineering has diminishing returns for structured tasks with clear labels. Complex prompts add latency, cost, and brittleness. Fine-tuning on even 500 labeled examples usually outperforms prompting for classification.
Decision framework: Use prompting for exploratory tasks, few-shot prototyping, and open-ended generation. Fine-tune when you have labeled data (>200 examples), need consistent structured output, or cost/latency matters. Consider distillation: use a large model to label data, then fine-tune a small model on those labels.
With 0.3% fraud rate, a model that predicts 'not fraud' for everything achieves 99.7% accuracy. Accuracy is a meaningless metric for imbalanced classification.
Switch metrics: use precision-recall AUC, F1-score, or Matthews Correlation Coefficient. Apply class rebalancing: SMOTE for oversampling, or class weights in the loss function. Use anomaly detection approaches (Isolation Forest) as complement. Set business-relevant thresholds: cost of false negative vs false positive.
Deep decision trees memorize training data by creating hyper-specific rules for individual examples. Increasing depth makes this worse -- it's adding complexity to an already overfit model.
Constrain the tree: limit max_depth (3-10 typical), set min_samples_leaf (5-20), use min_samples_split. Better yet, use ensemble methods: Random Forest (reduces variance via bagging) or Gradient Boosting (reduces bias via sequential learning). Use cross-validation to tune hyperparameters.
Naively switching to a cheaper model drops quality. The real issue is that most requests don't need the full power of GPT-4 -- many are cacheable, classifiable by difficulty tier, or can be handled by a fine-tuned smaller model.
Layer the solution: (1) Semantic cache -- identical/similar questions return cached responses (50-70% cache hit rate typical). (2) Model routing -- easy questions to GPT-3.5/Haiku, hard ones to GPT-4/Opus. (3) Fine-tune a small model on your specific domain using GPT-4 outputs as training data. (4) Batch non-urgent requests. Target: $0.01/request = $3K/month at 10K users/day.
AI Engineering — 5 questions
Questions answered before reveal