BIA 5302 — Machine Learning Models Cheat Sheet

Machine Learning & Programming 1  ·  Humber College 2025–2026  ·  Shmueli et al. (2019)

📘 Shmueli et al. (2019)🐍 Python + scikit-learn6 Core Models
Core Models
📈
Multiple Linear Regression
Chapter 6 · Supervised
Regression
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
Assumptions / Key Concepts
  • → Linearity between predictors and outcome
  • → Independence of observations
  • → Homoscedasticity (constant error variance)
  • → Normality of residuals
  • → No multicollinearity among predictors
Key Parameters
Coefficients (β)Intercept (β₀)Regularization (Ridge/Lasso)
✅ Pros
  • • Highly interpretable
  • • Fast to train
  • • Strong baseline model
  • • Coefficient = feature impact
❌ Cons
  • • Assumes linearity
  • • Sensitive to outliers
  • • Struggles with irrelevant features
  • • Multicollinearity inflates variance
Evaluation Metrics
RMSEMAEAdjusted R²
Python — scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
🔍
K-Nearest Neighbors (k-NN)
Chapter 7 · Supervised · Lazy Learner
Class / Reg
d(p,q) = √(Σ(pᵢ − qᵢ)²) [Euclidean] Predict: majority vote (class) or mean (reg)
Assumptions / Key Concepts
  • → No explicit training phase — stores all data
  • → Feature scale matters: normalize/standardize
  • → Smaller k → complex boundary; larger k → smoother
  • → Distance metric choice affects results significantly
Key Parameters
k (n_neighbors)distance metricweights (uniform/distance)
✅ Pros
  • • No training time
  • • Naturally multi-class
  • • Non-parametric
  • • Intuitive concept
❌ Cons
  • • Slow at prediction O(n)
  • • High memory usage
  • • Sensitive to irrelevant features
  • • Curse of dimensionality
Evaluation Metrics
AccuracyConfusion MatrixF1RMSE
Python — scikit-learn
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Scale features first: StandardScaler()
🎲
Naive Bayes Classifier
Chapter 8 · Supervised · Probabilistic
Classification
P(C|X) ∝ P(C) × ∏ P(xᵢ|C) Posterior ∝ Prior × Likelihood
Assumptions / Key Concepts
  • → Features are conditionally independent given the class
  • → Variants: Gaussian (continuous), Multinomial (counts), Bernoulli (binary)
  • → Laplace smoothing handles zero-probability problem
Key Parameters
Prior P(C)var_smoothing (Gaussian)alpha — Laplace (Multinomial)
✅ Pros
  • • Very fast training & prediction
  • • Works well with high-dim data
  • • Handles missing values
  • • Good for text classification
❌ Cons
  • • Independence rarely holds
  • • Poor probability calibration
  • • Struggles with correlated features
  • • Zero-frequency problem
Evaluation Metrics
AccuracyPrecisionRecallF1ROC-AUC
Python — scikit-learn
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)
🌳
Classification & Regression Trees (CART)
Chapter 9 · Supervised · Tree-Based
Class / Reg
Gini = 1 − Σ pᵢ² [Classification] MSE = (1/n) Σ(yᵢ − ȳ)² [Regression] Split: minimize impurity at each node
Assumptions / Key Concepts
  • → Recursive binary splitting on feature thresholds
  • → Non-parametric — no distributional assumptions
  • → Pruning (pre/post) controls overfitting
  • → Handles both numeric and categorical features
Key Parameters
max_depthmin_samples_splitcriterion (gini/entropy/mse)ccp_alpha (pruning)
✅ Pros
  • • Highly interpretable (visual)
  • • No feature scaling needed
  • • Handles mixed data types
  • • Captures non-linear patterns
❌ Cons
  • • Prone to overfitting
  • • High variance (unstable)
  • • Biased toward high-cardinality features
  • • Not globally optimal splits
Evaluation Metrics
AccuracyConfusion MatrixRMSE
Python — scikit-learn
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, criterion='gini')
model.fit(X_train, y_train)
# Visualize: plot_tree(model)
📊
Logistic Regression
Chapter 10 · Supervised · Probabilistic
Classification
P(y=1|X) = σ(z) = 1 / (1 + e⁻ᶻ) z = β₀ + β₁x₁ + … + βₙxₙ log-odds = ln(p / 1−p) = z
Assumptions / Key Concepts
  • → Linear relationship between log-odds and features
  • → Independence of observations
  • → No severe multicollinearity
  • → Large sample size recommended
  • → Outcome is binary (or ordinal/multinomial with extensions)
Key Parameters
C (inverse regularization)penalty (l1/l2/elasticnet)solvermax_iter
✅ Pros
  • • Probabilistic output (0–1)
  • • Interpretable coefficients
  • • Fast training
  • • Regularization available
❌ Cons
  • • Assumes linear log-odds
  • • Poor with non-linear boundaries
  • • Sensitive to outliers
  • • Requires feature scaling
Evaluation Metrics
AccuracyPrecisionRecallF1ROC-AUCLog-Loss
Python — scikit-learn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=1000)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)
🔬
Discriminant Analysis (LDA / QDA)
Chapter 12 · Supervised · Generative
Classification
LDA: δₖ(x) = xᵀΣ⁻¹μₖ − ½μₖᵀΣ⁻¹μₖ + log πₖ QDA: relaxes equal Σ assumption per class Maximizes between-class / within-class variance
Assumptions / Key Concepts
  • → LDA: Multivariate normality + equal covariance matrices across classes
  • → QDA: Multivariate normality; each class has its own covariance matrix
  • → Features should not be perfectly correlated
  • → LDA also serves as a dimensionality reduction technique
Key Parameters
solver (svd/lsqr/eigen)shrinkage (LDA)n_components (LDA dim. red.)reg_param (QDA)
✅ Pros
  • • Works well with small samples
  • • LDA also reduces dimensions
  • • Probabilistic class boundaries
  • • Efficient computation
❌ Cons
  • • Normality assumption often violated
  • • Sensitive to outliers
  • • LDA: equal covariance required
  • • Struggles with many features
Evaluation Metrics
AccuracyConfusion MatrixROC-AUCPrecision / Recall
Python — scikit-learn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X_train, y_train)
X_reduced = lda.transform(X_train) # dim. reduction
Model Comparison at a Glance
ModelTaskParametric?Scaling?Prob. Output?Interpretable?Non-linear?Train SpeedPred. SpeedBest For
Linear RegressionRegressionYesHelpsNoHighNoFastFastContinuous prediction, baseline
k-NNClass / RegNoRequiredSoftMediumYesNoneSlow O(n)Small datasets, local patterns
Naive BayesClassificationYesNot neededYesMediumNoVery FastVery FastText, high-dim, streaming
CARTClass / RegNoNot neededSoftHighYesMediumFastMixed data, explainability
Logistic RegressionClassificationYesRequiredYesHighNoFastFastBinary outcome, probability needed
Discriminant AnalysisClassificationYesHelpsYesMediumQDA: YesFastFastSmall samples, dim. reduction
Evaluation Metrics
Accuracy
Classification
(TP + TN) / (TP + TN + FP + FN)
Overall correctness. Misleading with imbalanced classes.
Precision
Classification
TP / (TP + FP)
Of predicted positives, how many are truly positive? Minimize false alarms.
Recall (Sensitivity)
Classification
TP / (TP + FN)
Of all actual positives, how many were caught? Minimize missed cases.
F1 Score
Classification
2 × (P × R) / (P + R)
Harmonic mean of Precision & Recall. Best for imbalanced data.
ROC-AUC
Classification
Area under ROC curve (TPR vs FPR)
Threshold-independent. AUC=1 perfect; AUC=0.5 random.
Log-Loss
Classification
−(1/n) Σ [y·log(p) + (1−y)·log(1−p)]
Penalizes confident wrong predictions. Lower is better.
RMSE
Regression
√( (1/n) Σ (yᵢ − ŷᵢ)² )
Same units as target. Penalizes large errors more than MAE.
MAE
Regression
(1/n) Σ |yᵢ − ŷᵢ|
Mean Absolute Error. Robust to outliers. Easy to interpret.
Regression
1 − SS_res / SS_tot
Proportion of variance explained. R²=1 perfect; R²=0 no better than mean.
Adjusted R²
Regression
1 − (1−R²)(n−1)/(n−p−1)
Penalizes adding irrelevant predictors. Use when comparing models with different feature counts.
Confusion Matrix Structure
Predicted: Positive
Predicted: Negative
Actual: Positive
TP
True Positive
FN
False Negative
Actual: Negative
FP
False Positive
TN
True Negative
Model Selection Guide
Outcome Type
Continuous → Multiple Linear Regression
Binary → Logistic Regression, LDA, Naive Bayes
Multi-class → CART, k-NN, LDA, Naive Bayes
Interpretability Priority
Linear Regression — coefficient = direct effect
Logistic Regression — log-odds interpretation
CART — visual tree structure
No Feature Scaling Needed
CART — distance-free splits
Naive Bayes — probability-based
⚠ k-NN, Logistic Reg, LDA benefit from scaling
Probabilistic Output Needed
Logistic Regression — calibrated probabilities
Naive Bayes — posterior probabilities
LDA / QDA — posterior via Bayes
Small Sample Size
LDA — stable with few observations
Linear Regression — low parameters
Logistic Regression — with regularization
High-Dimensional Data
Naive Bayes — scales well
Logistic Regression + L1 (Lasso)
Apply PCA first to reduce dimensions
Non-Linear Boundaries
CART — axis-aligned splits
k-NN — flexible local boundaries
QDA — quadratic boundaries
Mixed Data Types
CART — handles numeric & categorical
Naive Bayes — with appropriate variant
⚠ Encode categoricals for other models
ScenarioFirst ChoiceAlternativeAvoid
Predict house priceLinear RegressionRegression TreeNaive Bayes (classification only)
Spam detection (text)Naive BayesLogistic Regressionk-NN (too slow, high-dim)
Customer churn (binary)Logistic RegressionCARTLinear Regression (binary outcome)
Iris flower classificationLDAk-NNLinear Regression (multi-class)
Medical diagnosis (explain)CARTLogistic Regressionk-NN (black box, slow)
Real-time recommendationNaive BayesLogistic Regressionk-NN (slow prediction)
Dimension Reduction — PCA (Chapter 4)
What is PCA?

Principal Component Analysis transforms correlated features into a smaller set of uncorrelated components that capture maximum variance in the data.

Key Steps
  • → Standardize features (zero mean, unit variance)
  • → Compute covariance matrix
  • → Extract eigenvectors (principal components)
  • → Project data onto top k components
When to Use
  • → Too many features (curse of dimensionality)
  • → Multicollinearity in regression
  • → Visualization (reduce to 2D/3D)
  • → Speed up model training
Python
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# pca.explained_variance_ratio_
Standard ML Workflow (Data Mining Process — Chapter 2)
📦
Ch.1–2
Business Understanding
🗄️
Ch.2
Data Collection
🧹
Ch.2
Data Preprocessing
📊
Ch.3
Data Visualization
🔻
Ch.4
Dimension Reduction
🤖
Ch.6–12
Model Training
📏
Ch.5
Model Evaluation
⚙️
Ch.5
Tuning & Selection
🚀
Project
Deployment & Insights
Train / Val / Test Split
60% Train
20% Val
20% Test
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te =
train_test_split(X, y, test_size=0.2)
Cross-Validation

k-Fold CV averages performance across k splits — more reliable than a single split.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Overfitting vs Underfitting
Overfitting
Low train error, high test error. Model memorized noise.
Underfitting
High train & test error. Model too simple.
Fix overfitting: prune tree, increase k in k-NN, regularize regression, reduce features.
Preprocessing Checklist
  • Handle missing values (impute or drop)
  • Encode categoricals (OneHot / Label)
  • Scale features (StandardScaler / MinMaxScaler)
  • Remove or cap outliers
  • Check class imbalance (SMOTE / class_weight)