Ordinary Least Squares (OLS) is a statistical method used to find the best relationship between an observed variable and one or more predictors. It functions by creating a mathematical model that minimizes the sum of the squared differences between actual data points and the predicted values on a line. Marketers use OLS to forecast business outcomes, such as predicting turnover based on sales volume or analyzing how specific variables drive growth.
What is Ordinary Least Squares (OLS)?
OLS is a specific type of linear least squares method used to estimate unknown parameters within a regression model. It evaluates the relationship between independent quantitative variables and a single dependent variable. By minimizing the sum of the square errors (SSE), OLS calculates a "best fit" line that represents the data with the smallest possible discrepancy.
In some contexts, OLS is used interchangeably with the term "linear regression." It is frequently used in fields like economics to verify historical laws, such as how Okun's law describes the linear dependence of GDP growth on changes in the unemployment rate.
Why Ordinary Least Squares (OLS) matters
- Prediction accuracy: It minimizes the vertical distance between data points and the regression line to ensure the model fits the input dataset as closely as possible.
- Business forecasting: Marketers use it to predict a company’s turnover based on historical sales data or to estimate future performance in meteorology and biology.
- Significance testing: It helps determine if a specific variable (like ad spend) actually has explanatory power in predicting an outcome (like conversions) or if the result is due to random chance.
- Model evaluation: Through metrics like R-squared, OLS tells you exactly what percentage of the variation in your data is explained by your chosen variables.
How Ordinary Least Squares (OLS) works
The method operates through a specific mathematical process to find the "beta" coefficients of a linear equation.
- Requirement of Linearity: OLS assumes the dependent variable is a linear function of the regressors. While the relationship must be linear in parameters, the variables themselves can be non-linear (e.g., using a variable's square to create a quadratic model).
- Minimizing Squared Residuals: The model identifies the "residuals," which are the distances between observed and predicted values. It squares these distances and seeks the line where their sum is the absolute lowest.
- Coefficient Estimation: OLS uses a formula to estimate coefficients. The hierarchy of modeling assumptions places linearity and exogeneity as the most important factors for ensuring an unbiased answer.
- Assessing Fit: Once the line is drawn, the "Coefficient of Determination" (R-squared) evaluates the goodness-of-fit. A value close to 1 indicates the model explains the data well.
Best practices
- Screen for multicollinearity: Ensure that your independent variables are not highly correlated with each other. If predictors are redundant, the model becomes unstable and precision drops.
- Validate via Residual Plots: Always plot your residuals. An ideal plot shows a random scatter of points; a "fan shape" indicates non-constant variance (heteroscedasticity) that can make tests untrustworthy.
- Use Adjusted R-squared: When adding multiple variables to a model, rely on Adjusted R-squared rather than standard R-squared. Adjusted R-squared penalizes the inclusion of variables that do not actually improve predictive power.
- Check for influential observations: Look for "leverage points" or outliers. Values with high leverage can significantly pull the regression line away from the rest of the data, potentially leading to erroneous conclusions.
Common mistakes
- Mistake: Assuming correlation equals causation. Even if OLS shows a strong relationship, it does not prove one variable causes the other.
- Mistake: Ignoring rounding errors in data preparation. Fix: Use precise measurements. Small variations in how data is converted (e.g., rounding inches to centimeters) can have a real effect on the calculated coefficients.
- Mistake: Using OLS for binary outcomes. Fix: Linear regression is not appropriate for outcomes that are strictly 0 or 1; use logistic models instead.
- Mistake: Misinterpreting an R-squared of 0. Fix: Understand that an R-squared of 0 means your independent variables have no explanatory power for the variation in your dependent variable.
Examples
- Macroeconomics: OLS is used to construct regression lines for quarterly differences in economic growth and unemployment to visualize the strength of their relationship.
- Business Planning: An SEO practitioner might use OLS to predict the future height of a plant based on days of sun exposure, where a constant value (the plant's starting height) is added to the growth rate multiplied by time.
- Physical Data: Researchers used OLS to model average heights and weights for American women aged 30 to 39 using data from the 1975 World Almanac, discovering a quadratic relationship between height and weight.
Ordinary Least Squares (OLS) vs. Other Methods
| Feature | OLS Regression | Maximum Likelihood (MLE) | GMM Estimator |
|---|---|---|---|
| Primary Goal | Minimize sum of squared errors | Maximize probability of observed data | Match sample moments to population moments |
| Assumption Requirements | Fewer distributional assumptions | Requires normality for errors | Requires exogeneity of regressors |
| Small Sample Performance | Unbiased if exogeneity holds | Often biased in finite samples | Depends on choice of weighting matrix |
| Optimality | BLUE (Best Linear Unbiased Estimator) | Asymptotically efficient | Optimal among linear functions |
FAQ
What makes a regression line "best" in OLS? The "best" line is defined as the one that provides the smallest discrepancy between observed data and the model. Specifically, it minimizes the sum of the squared vertical distances between each data point and the corresponding point on the regression line.
What are residuals in OLS? Residuals are the distances between the actual data points and the fitted model. They represent the part of the variability in the data that the model was unable to capture. If residuals show a pattern, it usually suggests that the model is missing a key explanatory variable.
What is homoscedasticity, and why is it important? Homoscedasticity means the variance of your error terms is constant across all levels of your independent variables. If the variance changes (heteroscedasticity), your standard errors may be wrong, leading to misleading results in confidence intervals and significance tests.
Can OLS handle non-linear relationships? Yes, but only if the model remains linear in its parameters. You can transform your variables (such as squaring them or taking a logarithm) to fit a curved line to your data, which is common in polynomial regression.
Are there alternatives to OLS when assumptions are broken? If errors are correlated or variance is not constant, practitioners may use Generalized Least Squares (GLS). If the data contains many collinear variables, Partial Least Squares (PLS) or regularization methods like Ridge or Lasso regression are preferred.