Linear regression is designed for predicting continuous values, not categorical outcomes. Here's a breakdown of the issues:
⚠️ 1. Inappropriate Output Range
- Linear regression outputs any real number, from −∞\infty to +∞+\infty.
- For binary classification, we need outputs between 0 and 1, to represent probabilities.
To compensate, a threshold (e.g., 0.5) is chosen:
- Output ≥ 0.5 → class 1
- Output < 0.5 → class 0
But this introduces problems:
📉 2. Poor Decision Boundary with More Data
- With limited data, a best-fit line might seem to work.
- As more data is added (e.g., outliers or new samples further along the x-axis), the line shifts.
- This shift moves the threshold.
- Correct classifications may now become incorrect.
➡️ Model performance degrades with data expansion.
🔁 3. Linear Assumptions Don't Match Classification Needs
- Linear regression tries to minimize squared error.
- But classification aims to minimize classification error (i.e., misclassifications).
- These are not the same objective, so linear regression is not optimized for classification accuracy.