Core Models
📈Regression
Multiple Linear Regression
Chapter 6 · Supervised
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
Assumptions / Key Concepts
- → Linearity between predictors and outcome
- → Independence of observations
- → Homoscedasticity (constant error variance)
- → Normality of residuals
- → No multicollinearity among predictors
Key Parameters
Coefficients (β)Intercept (β₀)Regularization (Ridge/Lasso)
✅ Pros
- • Highly interpretable
- • Fast to train
- • Strong baseline model
- • Coefficient = feature impact
❌ Cons
- • Assumes linearity
- • Sensitive to outliers
- • Struggles with irrelevant features
- • Multicollinearity inflates variance
Evaluation Metrics
RMSEMAER²Adjusted R²
Python — scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
🔍Class / Reg
K-Nearest Neighbors (k-NN)
Chapter 7 · Supervised · Lazy Learner
d(p,q) = √(Σ(pᵢ − qᵢ)²) [Euclidean]
Predict: majority vote (class) or mean (reg)
Assumptions / Key Concepts
- → No explicit training phase — stores all data
- → Feature scale matters: normalize/standardize
- → Smaller k → complex boundary; larger k → smoother
- → Distance metric choice affects results significantly
Key Parameters
k (n_neighbors)distance metricweights (uniform/distance)
✅ Pros
- • No training time
- • Naturally multi-class
- • Non-parametric
- • Intuitive concept
❌ Cons
- • Slow at prediction O(n)
- • High memory usage
- • Sensitive to irrelevant features
- • Curse of dimensionality
Evaluation Metrics
AccuracyConfusion MatrixF1RMSE
Python — scikit-learn
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Scale features first: StandardScaler()
🎲Classification
Naive Bayes Classifier
Chapter 8 · Supervised · Probabilistic
P(C|X) ∝ P(C) × ∏ P(xᵢ|C)
Posterior ∝ Prior × Likelihood
Assumptions / Key Concepts
- → Features are conditionally independent given the class
- → Variants: Gaussian (continuous), Multinomial (counts), Bernoulli (binary)
- → Laplace smoothing handles zero-probability problem
Key Parameters
Prior P(C)var_smoothing (Gaussian)alpha — Laplace (Multinomial)
✅ Pros
- • Very fast training & prediction
- • Works well with high-dim data
- • Handles missing values
- • Good for text classification
❌ Cons
- • Independence rarely holds
- • Poor probability calibration
- • Struggles with correlated features
- • Zero-frequency problem
Evaluation Metrics
AccuracyPrecisionRecallF1ROC-AUC
Python — scikit-learn
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)
🌳Class / Reg
Classification & Regression Trees (CART)
Chapter 9 · Supervised · Tree-Based
Gini = 1 − Σ pᵢ² [Classification]
MSE = (1/n) Σ(yᵢ − ȳ)² [Regression]
Split: minimize impurity at each node
Assumptions / Key Concepts
- → Recursive binary splitting on feature thresholds
- → Non-parametric — no distributional assumptions
- → Pruning (pre/post) controls overfitting
- → Handles both numeric and categorical features
Key Parameters
max_depthmin_samples_splitcriterion (gini/entropy/mse)ccp_alpha (pruning)
✅ Pros
- • Highly interpretable (visual)
- • No feature scaling needed
- • Handles mixed data types
- • Captures non-linear patterns
❌ Cons
- • Prone to overfitting
- • High variance (unstable)
- • Biased toward high-cardinality features
- • Not globally optimal splits
Evaluation Metrics
AccuracyConfusion MatrixRMSER²
Python — scikit-learn
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, criterion='gini')
model.fit(X_train, y_train)
# Visualize: plot_tree(model)
📊Classification
Logistic Regression
Chapter 10 · Supervised · Probabilistic
P(y=1|X) = σ(z) = 1 / (1 + e⁻ᶻ)
z = β₀ + β₁x₁ + … + βₙxₙ
log-odds = ln(p / 1−p) = z
Assumptions / Key Concepts
- → Linear relationship between log-odds and features
- → Independence of observations
- → No severe multicollinearity
- → Large sample size recommended
- → Outcome is binary (or ordinal/multinomial with extensions)
Key Parameters
C (inverse regularization)penalty (l1/l2/elasticnet)solvermax_iter
✅ Pros
- • Probabilistic output (0–1)
- • Interpretable coefficients
- • Fast training
- • Regularization available
❌ Cons
- • Assumes linear log-odds
- • Poor with non-linear boundaries
- • Sensitive to outliers
- • Requires feature scaling
Evaluation Metrics
AccuracyPrecisionRecallF1ROC-AUCLog-Loss
Python — scikit-learn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)
🔬Classification
Discriminant Analysis (LDA / QDA)
Chapter 12 · Supervised · Generative
LDA: δₖ(x) = xᵀΣ⁻¹μₖ − ½μₖᵀΣ⁻¹μₖ + log πₖ
QDA: relaxes equal Σ assumption per class
Maximizes between-class / within-class variance
Assumptions / Key Concepts
- → LDA: Multivariate normality + equal covariance matrices across classes
- → QDA: Multivariate normality; each class has its own covariance matrix
- → Features should not be perfectly correlated
- → LDA also serves as a dimensionality reduction technique
Key Parameters
solver (svd/lsqr/eigen)shrinkage (LDA)n_components (LDA dim. red.)reg_param (QDA)
✅ Pros
- • Works well with small samples
- • LDA also reduces dimensions
- • Probabilistic class boundaries
- • Efficient computation
❌ Cons
- • Normality assumption often violated
- • Sensitive to outliers
- • LDA: equal covariance required
- • Struggles with many features
Evaluation Metrics
AccuracyConfusion MatrixROC-AUCPrecision / Recall
Python — scikit-learn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X_train, y_train)
X_reduced = lda.transform(X_train) # dim. reduction
Model Comparison at a Glance
| Model | Task | Parametric? | Scaling? | Prob. Output? | Interpretable? | Non-linear? | Train Speed | Pred. Speed | Best For |
|---|---|---|---|---|---|---|---|---|---|
| Linear Regression | Regression | Yes | Helps | No | High | No | Fast | Fast | Continuous prediction, baseline |
| k-NN | Class / Reg | No | Required | Soft | Medium | Yes | None | Slow O(n) | Small datasets, local patterns |
| Naive Bayes | Classification | Yes | Not needed | Yes | Medium | No | Very Fast | Very Fast | Text, high-dim, streaming |
| CART | Class / Reg | No | Not needed | Soft | High | Yes | Medium | Fast | Mixed data, explainability |
| Logistic Regression | Classification | Yes | Required | Yes | High | No | Fast | Fast | Binary outcome, probability needed |
| Discriminant Analysis | Classification | Yes | Helps | Yes | Medium | QDA: Yes | Fast | Fast | Small samples, dim. reduction |
Evaluation Metrics
Accuracy
Classification
(TP + TN) / (TP + TN + FP + FN)
Overall correctness. Misleading with imbalanced classes.
Precision
Classification
TP / (TP + FP)
Of predicted positives, how many are truly positive? Minimize false alarms.
Recall (Sensitivity)
Classification
TP / (TP + FN)
Of all actual positives, how many were caught? Minimize missed cases.
F1 Score
Classification
2 × (P × R) / (P + R)
Harmonic mean of Precision & Recall. Best for imbalanced data.
ROC-AUC
Classification
Area under ROC curve (TPR vs FPR)
Threshold-independent. AUC=1 perfect; AUC=0.5 random.
Log-Loss
Classification
−(1/n) Σ [y·log(p) + (1−y)·log(1−p)]
Penalizes confident wrong predictions. Lower is better.
RMSE
Regression
√( (1/n) Σ (yᵢ − ŷᵢ)² )
Same units as target. Penalizes large errors more than MAE.
MAE
Regression
(1/n) Σ |yᵢ − ŷᵢ|
Mean Absolute Error. Robust to outliers. Easy to interpret.
R²
Regression
1 − SS_res / SS_tot
Proportion of variance explained. R²=1 perfect; R²=0 no better than mean.
Adjusted R²
Regression
1 − (1−R²)(n−1)/(n−p−1)
Penalizes adding irrelevant predictors. Use when comparing models with different feature counts.
Confusion Matrix Structure
Predicted: Positive
Predicted: Negative
Actual: Positive
TP
True Positive
FN
False Negative
Actual: Negative
FP
False Positive
TN
True Negative
Model Selection Guide
Outcome Type
Continuous → Multiple Linear Regression
Binary → Logistic Regression, LDA, Naive Bayes
Multi-class → CART, k-NN, LDA, Naive Bayes
Interpretability Priority
Linear Regression — coefficient = direct effect
Logistic Regression — log-odds interpretation
CART — visual tree structure
No Feature Scaling Needed
CART — distance-free splits
Naive Bayes — probability-based
⚠ k-NN, Logistic Reg, LDA benefit from scaling
Probabilistic Output Needed
Logistic Regression — calibrated probabilities
Naive Bayes — posterior probabilities
LDA / QDA — posterior via Bayes
Small Sample Size
LDA — stable with few observations
Linear Regression — low parameters
Logistic Regression — with regularization
High-Dimensional Data
Naive Bayes — scales well
Logistic Regression + L1 (Lasso)
Apply PCA first to reduce dimensions
Non-Linear Boundaries
CART — axis-aligned splits
k-NN — flexible local boundaries
QDA — quadratic boundaries
Mixed Data Types
CART — handles numeric & categorical
Naive Bayes — with appropriate variant
⚠ Encode categoricals for other models
| Scenario | First Choice | Alternative | Avoid |
|---|---|---|---|
| Predict house price | Linear Regression | Regression Tree | Naive Bayes (classification only) |
| Spam detection (text) | Naive Bayes | Logistic Regression | k-NN (too slow, high-dim) |
| Customer churn (binary) | Logistic Regression | CART | Linear Regression (binary outcome) |
| Iris flower classification | LDA | k-NN | Linear Regression (multi-class) |
| Medical diagnosis (explain) | CART | Logistic Regression | k-NN (black box, slow) |
| Real-time recommendation | Naive Bayes | Logistic Regression | k-NN (slow prediction) |
Dimension Reduction — PCA (Chapter 4)
What is PCA?
Principal Component Analysis transforms correlated features into a smaller set of uncorrelated components that capture maximum variance in the data.
Key Steps
- → Standardize features (zero mean, unit variance)
- → Compute covariance matrix
- → Extract eigenvectors (principal components)
- → Project data onto top k components
When to Use
- → Too many features (curse of dimensionality)
- → Multicollinearity in regression
- → Visualization (reduce to 2D/3D)
- → Speed up model training
Python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# pca.explained_variance_ratio_
Standard ML Workflow (Data Mining Process — Chapter 2)
📦
Ch.1–2
Business
Understanding
🗄️
Ch.2
Data
Collection
🧹
Ch.2
Data
Preprocessing
📊
Ch.3
Data
Visualization
🔻
Ch.4
Dimension
Reduction
🤖
Ch.6–12
Model
Training
📏
Ch.5
Model
Evaluation
⚙️
Ch.5
Tuning &
Selection
🚀
Project
Deployment &
Insights
Train / Val / Test Split
60% Train
20% Val
20% Test
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te =
train_test_split(X, y, test_size=0.2)
Cross-Validation
k-Fold CV averages performance across k splits — more reliable than a single split.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Overfitting vs Underfitting
Overfitting
Low train error, high test error. Model memorized noise.
Underfitting
High train & test error. Model too simple.
Fix overfitting: prune tree, increase k in k-NN, regularize regression, reduce features.
Preprocessing Checklist
- ✓ Handle missing values (impute or drop)
- ✓ Encode categoricals (OneHot / Label)
- ✓ Scale features (StandardScaler / MinMaxScaler)
- ✓ Remove or cap outliers
- ✓ Check class imbalance (SMOTE / class_weight)