BIA 5302 — Machine Learning Models Cheat Sheet

Core Models

📈

Multiple Linear Regression

Chapter 6 · Supervised

Regression

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Assumptions / Key Concepts

→ Linearity between predictors and outcome
→ Independence of observations
→ Homoscedasticity (constant error variance)
→ Normality of residuals
→ No multicollinearity among predictors

Key Parameters

Coefficients (β)Intercept (β₀)Regularization (Ridge/Lasso)

✅ Pros

• Highly interpretable
• Fast to train
• Strong baseline model
• Coefficient = feature impact

❌ Cons

• Assumes linearity
• Sensitive to outliers
• Struggles with irrelevant features
• Multicollinearity inflates variance

Evaluation Metrics

RMSEMAER²Adjusted R²

Python — scikit-learn

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

🔍

K-Nearest Neighbors (k-NN)

Chapter 7 · Supervised · Lazy Learner

Class / Reg

d(p,q) = √(Σ(pᵢ − qᵢ)²)  [Euclidean]
Predict: majority vote (class) or mean (reg)

Assumptions / Key Concepts

→ No explicit training phase — stores all data
→ Feature scale matters: normalize/standardize
→ Smaller k → complex boundary; larger k → smoother
→ Distance metric choice affects results significantly

Key Parameters

k (n_neighbors)distance metricweights (uniform/distance)

✅ Pros

• No training time
• Naturally multi-class
• Non-parametric
• Intuitive concept

❌ Cons

• Slow at prediction O(n)
• High memory usage
• Sensitive to irrelevant features
• Curse of dimensionality

Evaluation Metrics

AccuracyConfusion MatrixF1RMSE

Python — scikit-learn

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Scale features first: StandardScaler()

🎲

Naive Bayes Classifier

Chapter 8 · Supervised · Probabilistic

Classification

P(C|X) ∝ P(C) × ∏ P(xᵢ|C)
Posterior ∝ Prior × Likelihood

Assumptions / Key Concepts

→ Features are conditionally independent given the class
→ Variants: Gaussian (continuous), Multinomial (counts), Bernoulli (binary)
→ Laplace smoothing handles zero-probability problem

Key Parameters

Prior P(C)var_smoothing (Gaussian)alpha — Laplace (Multinomial)

✅ Pros

• Very fast training & prediction
• Works well with high-dim data
• Handles missing values
• Good for text classification

❌ Cons

• Independence rarely holds
• Poor probability calibration
• Struggles with correlated features
• Zero-frequency problem

Evaluation Metrics

AccuracyPrecisionRecallF1ROC-AUC

Python — scikit-learn

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)

🌳

Classification & Regression Trees (CART)

Chapter 9 · Supervised · Tree-Based

Class / Reg

Gini = 1 − Σ pᵢ²  [Classification]
MSE = (1/n) Σ(yᵢ − ȳ)²  [Regression]
Split: minimize impurity at each node

Assumptions / Key Concepts

→ Recursive binary splitting on feature thresholds
→ Non-parametric — no distributional assumptions
→ Pruning (pre/post) controls overfitting
→ Handles both numeric and categorical features

Key Parameters

max_depthmin_samples_splitcriterion (gini/entropy/mse)ccp_alpha (pruning)

✅ Pros

• Highly interpretable (visual)
• No feature scaling needed
• Handles mixed data types
• Captures non-linear patterns

❌ Cons

• Prone to overfitting
• High variance (unstable)
• Biased toward high-cardinality features
• Not globally optimal splits

Evaluation Metrics

AccuracyConfusion MatrixRMSER²

Python — scikit-learn

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, criterion='gini')
model.fit(X_train, y_train)
# Visualize: plot_tree(model)

📊

Logistic Regression

Chapter 10 · Supervised · Probabilistic

Classification

P(y=1|X) = σ(z) = 1 / (1 + e⁻ᶻ)
z = β₀ + β₁x₁ + … + βₙxₙ
log-odds = ln(p / 1−p) = z

Assumptions / Key Concepts

→ Linear relationship between log-odds and features
→ Independence of observations
→ No severe multicollinearity
→ Large sample size recommended
→ Outcome is binary (or ordinal/multinomial with extensions)

Key Parameters

C (inverse regularization)penalty (l1/l2/elasticnet)solvermax_iter

✅ Pros

• Probabilistic output (0–1)
• Interpretable coefficients
• Fast training
• Regularization available

❌ Cons

• Assumes linear log-odds
• Poor with non-linear boundaries
• Sensitive to outliers
• Requires feature scaling

Evaluation Metrics

AccuracyPrecisionRecallF1ROC-AUCLog-Loss

Python — scikit-learn

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)

🔬

Discriminant Analysis (LDA / QDA)

Chapter 12 · Supervised · Generative

Classification

LDA: δₖ(x) = xᵀΣ⁻¹μₖ − ½μₖᵀΣ⁻¹μₖ + log πₖ
QDA: relaxes equal Σ assumption per class
Maximizes between-class / within-class variance

Assumptions / Key Concepts

→ LDA: Multivariate normality + equal covariance matrices across classes
→ QDA: Multivariate normality; each class has its own covariance matrix
→ Features should not be perfectly correlated
→ LDA also serves as a dimensionality reduction technique

Key Parameters

solver (svd/lsqr/eigen)shrinkage (LDA)n_components (LDA dim. red.)reg_param (QDA)

✅ Pros

• Works well with small samples
• LDA also reduces dimensions
• Probabilistic class boundaries
• Efficient computation

❌ Cons

• Normality assumption often violated
• Sensitive to outliers
• LDA: equal covariance required
• Struggles with many features

Evaluation Metrics

AccuracyConfusion MatrixROC-AUCPrecision / Recall

Python — scikit-learn

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X_train, y_train)
X_reduced = lda.transform(X_train) # dim. reduction

Model Comparison at a Glance

Model	Task	Parametric?	Scaling?	Prob. Output?	Interpretable?	Non-linear?	Train Speed	Pred. Speed	Best For
Linear Regression	Regression	Yes	Helps	No	High	No	Fast	Fast	Continuous prediction, baseline
k-NN	Class / Reg	No	Required	Soft	Medium	Yes	None	Slow O(n)	Small datasets, local patterns
Naive Bayes	Classification	Yes	Not needed	Yes	Medium	No	Very Fast	Very Fast	Text, high-dim, streaming
CART	Class / Reg	No	Not needed	Soft	High	Yes	Medium	Fast	Mixed data, explainability
Logistic Regression	Classification	Yes	Required	Yes	High	No	Fast	Fast	Binary outcome, probability needed
Discriminant Analysis	Classification	Yes	Helps	Yes	Medium	QDA: Yes	Fast	Fast	Small samples, dim. reduction

Evaluation Metrics

Accuracy

Classification

(TP + TN) / (TP + TN + FP + FN)

Overall correctness. Misleading with imbalanced classes.

Precision

Classification

TP / (TP + FP)

Of predicted positives, how many are truly positive? Minimize false alarms.

Recall (Sensitivity)

Classification

TP / (TP + FN)

Of all actual positives, how many were caught? Minimize missed cases.

F1 Score

Classification

2 × (P × R) / (P + R)

Harmonic mean of Precision & Recall. Best for imbalanced data.

ROC-AUC

Classification

Area under ROC curve (TPR vs FPR)

Threshold-independent. AUC=1 perfect; AUC=0.5 random.

Log-Loss

Classification

−(1/n) Σ [y·log(p) + (1−y)·log(1−p)]

Penalizes confident wrong predictions. Lower is better.

RMSE

Regression

√( (1/n) Σ (yᵢ − ŷᵢ)² )

Same units as target. Penalizes large errors more than MAE.

MAE

Regression

(1/n) Σ |yᵢ − ŷᵢ|

Mean Absolute Error. Robust to outliers. Easy to interpret.

R²

Regression

1 − SS_res / SS_tot

Proportion of variance explained. R²=1 perfect; R²=0 no better than mean.

Adjusted R²

Regression

1 − (1−R²)(n−1)/(n−p−1)

Penalizes adding irrelevant predictors. Use when comparing models with different feature counts.

Confusion Matrix Structure

Predicted: Positive

Predicted: Negative

Actual: Positive

True Positive

False Negative

Actual: Negative

False Positive

True Negative

Model Selection Guide

Outcome Type

Continuous → Multiple Linear Regression

Binary → Logistic Regression, LDA, Naive Bayes

Multi-class → CART, k-NN, LDA, Naive Bayes

Interpretability Priority

Linear Regression — coefficient = direct effect

Logistic Regression — log-odds interpretation

CART — visual tree structure

No Feature Scaling Needed

CART — distance-free splits

Naive Bayes — probability-based

⚠ k-NN, Logistic Reg, LDA benefit from scaling

Probabilistic Output Needed

Logistic Regression — calibrated probabilities

Naive Bayes — posterior probabilities

LDA / QDA — posterior via Bayes

Small Sample Size

LDA — stable with few observations

Linear Regression — low parameters

Logistic Regression — with regularization

High-Dimensional Data

Naive Bayes — scales well

Logistic Regression + L1 (Lasso)

Apply PCA first to reduce dimensions

Non-Linear Boundaries

CART — axis-aligned splits

k-NN — flexible local boundaries

QDA — quadratic boundaries

Mixed Data Types

CART — handles numeric & categorical

Naive Bayes — with appropriate variant

⚠ Encode categoricals for other models

Scenario	First Choice	Alternative	Avoid
Predict house price	Linear Regression	Regression Tree	Naive Bayes (classification only)
Spam detection (text)	Naive Bayes	Logistic Regression	k-NN (too slow, high-dim)
Customer churn (binary)	Logistic Regression	CART	Linear Regression (binary outcome)
Iris flower classification	LDA	k-NN	Linear Regression (multi-class)
Medical diagnosis (explain)	CART	Logistic Regression	k-NN (black box, slow)
Real-time recommendation	Naive Bayes	Logistic Regression	k-NN (slow prediction)

Dimension Reduction — PCA (Chapter 4)

What is PCA?

Principal Component Analysis transforms correlated features into a smaller set of uncorrelated components that capture maximum variance in the data.

Key Steps

→ Standardize features (zero mean, unit variance)
→ Compute covariance matrix
→ Extract eigenvectors (principal components)
→ Project data onto top k components

When to Use

→ Too many features (curse of dimensionality)
→ Multicollinearity in regression
→ Visualization (reduce to 2D/3D)
→ Speed up model training

Python

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# pca.explained_variance_ratio_

Standard ML Workflow (Data Mining Process — Chapter 2)

📦

Ch.1–2

Business Understanding

→

🗄️

Ch.2

Data Collection

→

🧹

Ch.2

Data Preprocessing

→

📊

Ch.3

Data Visualization

→

🔻

Ch.4

Dimension Reduction

→

🤖

Ch.6–12

Model Training

→

📏

Ch.5

Model Evaluation

→

⚙️

Ch.5

Tuning & Selection

→

🚀

Project

Deployment & Insights

Train / Val / Test Split

60% Train

20% Val

20% Test

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = 
  train_test_split(X, y, test_size=0.2)

Cross-Validation

k-Fold CV averages performance across k splits — more reliable than a single split.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

Overfitting vs Underfitting

Overfitting

Low train error, high test error. Model memorized noise.

Underfitting

High train & test error. Model too simple.

Fix overfitting: prune tree, increase k in k-NN, regularize regression, reduce features.

Preprocessing Checklist

✓ Handle missing values (impute or drop)
✓ Encode categoricals (OneHot / Label)
✓ Scale features (StandardScaler / MinMaxScaler)
✓ Remove or cap outliers
✓ Check class imbalance (SMOTE / class_weight)