Linear Regression Assumptions:
- Linearity: The relationship between X and Y is linear
- Homoscedasticity: The variance of residual is the same for any value of X
- Independence: Observations are independent of each other
- No outliers: Extreme values can have disproportionate influence on the model
Common Failure Cases:
- Heteroscedastic Data: When variance increases with X, simple OLS is inefficient
- Non-linear Relationships: Linear models can't capture curvature (e.g., quadratic patterns)
- Outliers: Extreme values pull the line toward them, distorting the fit
- Clustered Data: May indicate a need for separate models or additional features
Metrics:
- Mean Squared Error (MSE): Average of squared residuals (lower is better)
- R² Value: Proportion of variance explained by the model (higher is better, max 1.0)
- Sum of Squared Residuals: Total squared vertical distance between points and the line
Try different data distributions to see how well linear regression performs in each scenario!