Chapter 3: Linear Regression

Chapter Introduction

Two Hundred Years of Regression

The story of linear regression begins with a 23-year-old mathematician in 1809. Carl Friedrich Gauss, working to predict the orbit of the asteroid Ceres, invented the method of least squares — the same computational backbone that powers every regression output you will run in this chapter. His insight was elegant: given noisy measurements, find the line that minimizes the sum of squared errors. He published it almost as an afterthought in a book about celestial mechanics, unaware that he had created the most widely used statistical technique in the history of science.

Seventy-six years later, the British scientist Francis Galton was studying the heights of parents and their adult children. He noticed something surprising: tall parents tend to have children who are tall, but not quite as tall; short parents tend to have children who are short, but not quite as short. He called this phenomenon “regression to mediocrity” — the tendency of extreme values to pull back toward the average across generations. The word “regression” stuck, even though modern regression does far more than describe hereditary reversion. Galton’s student Karl Pearson formalized the correlation coefficient and the mathematics of simple linear regression in 1896, and Ronald A. Fisher in the 1920s extended the framework to multiple predictors, analysis of variance, and the significance tests you will use throughout this chapter.

Why Regression Dominates Business Analytics

Regression is not merely a statistical technique — it is the language that business uses to quantify relationships. A McKinsey study of Fortune 500 analytics practices consistently finds that regression (linear and logistic) is the most frequently deployed model in production environments, ahead of decision trees, neural networks, and everything else. Why? Because regression produces interpretable coefficients: a single number that says “each additional dollar of median household income in a neighborhood adds $X to expected pharmacy profit, holding all other demographic features constant.” That ceteris paribus interpretation is what makes regression indispensable in strategy, marketing, operations, and finance.

The finance applications are particularly powerful, and carry Nobel Prize provenance. William Sharpe shared the 1990 Nobel Memorial Prize in Economic Sciences largely for the Capital Asset Pricing Model (CAPM) — a two-parameter regression that defined how Wall Street prices risk for the next three decades. CAPM’s slope coefficient, beta ($\beta$), became the lingua franca of equity analysis: aggressive stocks with $\beta > 1$ amplify market swings; defensive stocks with $\beta < 1$ dampen them. Every Bloomberg terminal displays beta. Every equity research report references it.

Eugene Fama shared the 2013 Nobel Prize (with Lars Peter Hansen and Robert Shiller) in large part for demonstrating that CAPM was incomplete. With Kenneth French, Fama documented that small-cap stocks and value stocks earned returns that the market factor alone could not explain. The resulting Fama-French factor models — first three factors, then five — replaced CAPM as the benchmark for risk adjustment in academic research and in institutional portfolio management. Today, factor investing based on Fama-French principles accounts for over $1 trillion in assets under management at firms like AQR Capital Management, Dimensional Fund Advisors (DFA), and BlackRock.

What You Will Learn

This chapter follows the natural workflow of applied regression analysis:

Model — write down the mathematical relationship between variables
Estimate — fit the model using OLS (Ordinary Least Squares)
Infer — test hypotheses and build confidence intervals for coefficients
Diagnose — check the LINE assumptions that make inference valid
Predict — generate point forecasts and prediction intervals for new observations
Select — choose the right set of predictors using AIC, BIC, and adjusted $R^2$

Each step connects to a real finance or business application. You will fit CAPM to NVDA stock returns, test whether NVDA has earned Jensen’s alpha, extend to the Fama-French five-factor model, and predict profits for a pharmacy chain using multiple regression. By the end, you will be able to read a regression output table the way a quant reads a Bloomberg screen — fluently, critically, and with precise understanding of what every number means.

Why This Chapter Is Weighted at 30%

Regression is the foundation on which Chapters 4 through 6 rest. Clustering (Chapter 4) uses distance metrics that assume the same geometric intuition as regression residuals. Classification models (Chapter 5) are direct extensions of the regression framework to binary outcomes. Time series forecasting (Chapter 6) adapts regression to handle serial dependence. If you understand regression deeply — not just the mechanics but the statistical logic — every subsequent topic becomes easier. The 30% weight reflects this foundational status. Master regression, and you have mastered the core reasoning pattern of quantitative business analysis.

The big picture

Model → Estimate → Infer → Diagnose → Predict → Select.

Every section in this chapter maps to one step in this workflow. Keep the pipeline in mind as you progress.

Why Predict?

Why it matters

Prediction = data-driven decisions.

Instead of guessing, we let the data tell us what to expect.

Today’s finance questions:

“Is NVDA aggressive or defensive?”
“Which forces drive NVDA returns?”
“Does NVDA earn alpha?”

Visual idea: CAPM fits a line through NVDA returns vs. market returns — slope = beta, intercept = alpha.

1. Visualizing Association

Background: The History of Correlation

The idea of measuring association between two variables has a surprisingly rich intellectual history. Francis Galton’s famous quincunx (a peg-board that demonstrated the normal distribution physically) was not just a parlor trick — it was Galton’s attempt to visualize how two generations of heights co-varied. In 1885, Galton drew scatter plots of parent heights against child heights and noticed the now-famous regression-to-the-mean effect. He labeled the slope of this scatter plot the “index of co-relation.”

It was his student Karl Pearson who cleaned up the mathematics in 1896, deriving the formula we use today. The Pearson correlation $r$ has a beautiful theoretical justification: it is the cosine of the angle between the two mean-centered data vectors in $n$-dimensional space. That geometric interpretation is why $r$ is bounded in $[-1, +1]$ — it follows directly from the Cauchy-Schwarz inequality, one of the most fundamental inequalities in mathematics:

\[|\langle \mathbf{u}, \mathbf{v} \rangle| \le \|\mathbf{u}\| \cdot \|\mathbf{v}\|\]

applied to the centered vectors $\mathbf{u} = (x_1 - \bar{x}, \ldots, x_n - \bar{x})$ and $\mathbf{v} = (y_1 - \bar{y}, \ldots, y_n - \bar{y})$.

The most important warning in statistics: correlation is not causation. Galton’s discovery that taller parents have taller children is a correlation. Whether height is caused by genetics, nutrition, or social environment requires a different type of evidence. The computer scientist and philosopher Judea Pearl (Turing Award, 2011) formalized this distinction through his do-calculus: the difference between observing that $X = x$ and intervening to set $X = x$ is the difference between correlation and causation. In finance, correlation between two stock returns does not mean one causes the other — they may both respond to a common factor (the market). This is precisely why CAPM uses regression (which quantifies a directional relationship) rather than correlation alone.

Warning

The classic cautionary tale: ice cream sales correlate with drowning rates. Both rise in summer. Neither causes the other. Always ask: is there a confounding variable?

Correlation Coefficient

The Pearson correlation $r$ measures the strength of linear association:

\[ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

$r$ value	Interpretation
$-1$	Perfect negative linear
$-0.5$	Strong negative
$0$	No linear association
$+0.5$	Strong positive
$+1$	Perfect positive linear

Key takeaway

$r$ ranges from $-1$ to $+1$. Close to $\pm 1$ = strong linear relationship. Close to $0$ = weak or no linear relationship.

Load Data & Compute Correlation

The book ships with real NVDA and SPY daily closes for 2023–2024 as a small CSV next to this page. The analysis below runs entirely in your browser — no external download, no API call.

What just happened

The browser fetched a small CSV (~25 KB) hosted alongside this book, parsed it into a pandas DataFrame, and computed daily returns + correlation — all in your browser.

Interpretation. The correlation matrix shows how tightly NVDA and SPY daily returns move together over 2023–2024. A typical value of $r \approx 0.74$ tells you that roughly 74% of the directional day-to-day movement in NVDA can be predicted from knowing which way the S&P 500 moved. The remaining 26% is idiosyncratic — NVDA-specific news such as earnings surprises, GPU demand reports, and regulatory announcements. Notice that the diagonal is always 1 (a return is perfectly correlated with itself). The off-diagonal $r$ is symmetric: correlation between NVDA and SPY equals correlation between SPY and NVDA.

The key limitation: this single number compresses the entire relationship into a scalar. It tells you the strength of the linear association, but not the slope (how much NVDA moves per 1% SPY move — that requires regression), nor the direction of any causal mechanism.

Simulated NVDA vs SPY in Pyodide

Scatter Plot: NVDA vs. SPY

Key takeaway

Slope $>1$: NVDA moves more than the market.

Interpretation. The scatter plot with fitted line makes the relationship visceral. Each dot is one trading day; its horizontal position is the SPY return and its vertical position is the NVDA return. The red regression line — slope $\approx 1.9$ in this simulation — slices through the cloud at a steep angle. On a day when SPY drops 2%, the line predicts NVDA falls roughly $-3.8\%$. The scatter around the line is the idiosyncratic (firm-specific) noise that the market cannot predict.

Notice that the cloud is elongated diagonally from lower-left to upper-right — this is positive correlation made visual. If the slope were exactly 1.0, the cloud would run along a 45-degree angle and NVDA would behave exactly like the market (a perfect index fund). The slope exceeding 1.0 is the geometric fingerprint of an aggressive stock — one that amplifies both market rallies and market selloffs.

In practice

Every equity analyst at Goldman Sachs, Morgan Stanley, and JPMorgan runs this exact chart when initiating coverage on a new stock. The slope (beta) goes directly into the valuation model. A stock with $\beta = 2$ gets a higher required return (cost of equity) in the discounted cash flow model — investors demand compensation for taking on extra market risk.

Warning

Common pitfall: Don’t confuse correlation with slope. A high correlation ($r$ near 1) does not mean the slope is near 1. Two variables can be perfectly correlated but have a slope of 0.1 or 10. Regression gives you the magnitude of the relationship; correlation only gives relative strength.

2. Simple Linear Regression: CAPM

Background: Sharpe’s Revolution

In 1964, William Sharpe published “Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk” in the Journal of Finance. It was a remarkable paper for its simplicity: with one slope coefficient, it claimed to explain why different stocks should earn different expected returns in equilibrium. The key insight was that only systematic risk — risk correlated with the market — commands a premium. Idiosyncratic risk (the scatter around the regression line) can be diversified away by holding a portfolio, so the market pays nothing for bearing it.

CAPM represented the first rigorous translation of risk into a quantifiable, tradeable concept. Before Sharpe, “risk” in finance was largely qualitative. After Sharpe, every stock had a number — its beta — and that number determined its cost of equity. The implication for corporate finance was profound: a project funded by a high-beta firm needs a higher hurdle rate than the same project funded by a low-beta utility.

The model attracted fierce criticism. Richard Roll’s critique (1977) pointed out that the “market portfolio” in CAPM theory is unobservable — it should include every investable asset (stocks, bonds, real estate, human capital), not just the S&P 500. Using SPY as a proxy introduces measurement error that makes any empirical test of CAPM inconclusive. Despite this, beta became (and remains) the default risk measure in equity research, corporate finance, and regulatory rate-setting for utilities worldwide. Sharpe shared the 1990 Nobel Prize with Harry Markowitz (portfolio theory) and Merton Miller (capital structure) — the three pillars of modern financial economics.

In practice

At Goldman Sachs equity research, every stock in coverage has a Bloomberg-supplied beta prominently displayed in the first row of each research note. Portfolio managers use betas to construct “market-neutral” portfolios (long low-beta, short high-beta) that profit from mispricing without taking on net market exposure. CAPM’s simplicity is a feature, not a bug — it communicates risk in a common language the entire financial industry speaks.

Simple Linear Regression Model

\[ y = \alpha + \beta x + \epsilon \]

Geometric view: the regression line passes through the data; $\alpha$ is its intercept, $\beta$ is its slope, and $\epsilon_i$ is the vertical distance from each point to the line.

$\alpha$ = intercept — predicted $y$ when $x=0$
$\beta$ = slope — change in $y$ per unit $x$
$\epsilon$ = error — what the model cannot explain

The Capital Asset Pricing Model (CAPM)

CAPM is a simple linear regression of a stock’s excess return on the market:

\[ Y = \alpha + \beta\, X + \epsilon \]

where $Y = R_{\text{NVDA}} - R_f$ (NVDA excess return) and $X = R_m - R_f$ (market excess return).

$\alpha$ (Jensen’s alpha) = return above what CAPM predicts
- $\alpha > 0$: NVDA outperforms the market
- $\alpha = 0$: no extra reward (efficient market)
$\beta$ (market beta) = sensitivity to market
- $\beta > 1$: aggressive — amplifies market
- $\beta < 1$: defensive — dampens market

Prepare the Data

Why it matters

We subtract $R_f$ because CAPM measures the premium over T-bills.

Fit the CAPM with `statsmodels`

Key takeaway

Three steps: (1) sm.add_constant(X) adds $\alpha$, (2) sm.OLS(y, X) defines the model, (3) .fit() estimates.

Why it matters

.fit() uses Ordinary Least Squares (OLS) — the same method from ISOM 2500. It finds $\hat{\alpha}, \hat{\beta}$ that minimise $\sum (y_i - \hat{y}_i)^2$.

Interpretation. The model.summary() output is dense — here is how to read it systematically:

$\hat{\beta} \approx 2.33$: NVDA moves approximately 2.33% for every 1% move in the market. In real money terms: if you hold $10,000 of NVDA and the market drops 10%, you expect to lose approximately $2,330 — before adding in the idiosyncratic component. This is the leverage embedded in holding an aggressive tech stock.
$\hat{\alpha} \approx 0.0036$ (daily): This translates to roughly $0.0036 \times 252 \approx 0.91\%$ annualized daily alpha. Whether this is statistically significant (p < 0.05) determines whether you can claim NVDA genuinely outperforms its risk-adjusted benchmark.
$R^2 \approx 0.33$: The market explains about 33% of NVDA’s daily return variance. This means 67% of day-to-day NVDA movement is driven by firm-specific news — GPU shipment data, data center contracts, analyst upgrades, and so on. That 67% is diversifiable in a portfolio.
$n = 400$ observations (80% of 500 days): sufficient for the $t$-distribution to be reliable and for the CLT to kick in even if residuals are not perfectly normal.

Warning

Common pitfall: $p(\alpha) < 0.05$ in a simulated model is not the same as finding real alpha. In the simulation, the true alpha was set to 0.0036, so of course it shows up as significant. In real data, most stocks do NOT have significant Jensen’s alpha — this is a direct implication of the efficient market hypothesis.

Reading the CAPM Output

	coef	std err	t	P>\|t\|
`const` ($\alpha$)	0.0036	0.001	2.71	0.007
`Mkt_excess` ($\beta$)	2.332	0.166	14.04	0.000
$R^2$	0.331

$\beta = 2.33$: NVDA moves 2.33% per 1% market move
$p(\beta) = 0.000$: market is significant
$p(\alpha) = 0.007$: $\alpha$ is significant

Why it matters

$R^2 = 0.331$: market explains 33.1% of NVDA variation. Remaining 66.9% = idiosyncratic risk.

Variation Decomposition

In any linear regression, the total variation in $Y$ decomposes as:

\[ \underbrace{SST}_{\sum(y_i - \bar{y})^2} \;=\; \underbrace{SSR}_{\sum(\hat{y}_i - \bar{y})^2} \;+\; \underbrace{SSE}_{\sum(y_i - \hat{y}_i)^2} \qquad R^2 = \frac{SSR}{SST} \]

SST = total variation in $Y$
SSR = variation explained by $X$
SSE = variation not explained (residual)

Why it matters

In CAPM: SSR = systematic risk (market), SSE = idiosyncratic risk (NVDA-specific).

Interpretation. The variation decomposition is the balance sheet of regression. If SST = total variance in NVDA returns, then:

SSR is the share the market explains — systematic risk that cannot be diversified away. Portfolio managers cannot eliminate this by adding more stocks to a portfolio.
SSE is idiosyncratic variance — the part unique to NVDA. By holding 20–30 stocks, a portfolio manager can reduce SSE to near zero across the portfolio. This is the mathematical foundation of diversification.

A practical implication: if $R^2 = 0.33$, then only 33% of NVDA’s total daily volatility comes from market moves. The other 67% — NVDA-specific news — is the kind of risk that a stock picker is paid to forecast. A quant fund trying to predict NVDA’s return has 67% of the variance “available” to model with NVDA-specific signals.

3. Inference on Regression Coefficients

Background: Fisher, Neyman-Pearson, and the p-Value Wars

The significance test is one of the most misunderstood tools in science. It has two origins, often conflated. Ronald A. Fisher (1890–1962) introduced the $p$-value as a continuous measure of evidence: a small $p$ suggests the data are incompatible with the null hypothesis, and the researcher should update their beliefs accordingly. Fisher never advocated for a hard 0.05 threshold — he saw it as one useful benchmark among many.

Jerzy Neyman and Egon Pearson (1933) proposed a different framework: binary decision-making with explicit control of Type I error (false positive rate $\alpha$) and Type II error (false negative rate $\beta$). Their approach requires specifying $H_0$ and $H_a$ before seeing the data and then making a binary reject/fail-to-reject decision. This is the framework behind “significance at the 5% level.”

Modern practice unhelpfully blends both frameworks. Researchers compute Fisher’s $p$-value but make Neyman-Pearson binary decisions — a hybrid neither inventor endorsed.

The replication crisis in social science (2010s) exposed the dangers of misusing $p$-values. Studies with $p < 0.05$ failed to replicate at rates exceeding 50% in psychology and 30% in economics. The causes were multiple: $p$-hacking (testing many hypotheses and reporting only significant ones), underpowered studies (small $n$), and confusion about what $p$ actually means.

In finance, the replication crisis is real too. Harvey, Liu and Zhu (2016, Review of Financial Studies) surveyed 316 claimed “factors” in the equity premium literature and concluded that the $t$-statistic threshold for declaring a new factor significant should be raised from 2.0 to at least 3.0, given the multiple testing problem. The lesson for CAPM: a $p$-value is evidence, not proof.

Warning

What a $p$-value is NOT: It is NOT the probability that the null hypothesis is true. It is NOT the probability that your result occurred by chance. It IS the probability of observing data this extreme (or more extreme) if $H_0$ were true. These distinctions are not pedantic — confusing them leads to systematic over-confidence in regression results.

LINE Assumptions for Inference

For $t$-tests and confidence intervals to be valid, the regression must satisfy:

L Linearity — $E[Y|X] = \alpha + \beta X$
I Independence — $\epsilon_i$ are independent
N Normality — $\epsilon_i \sim N(0, \sigma^2)$
E Equal variance — $\text{Var}(\epsilon_i) = \sigma^2$

Important

If LINE holds: $\hat{\alpha}$ and $\hat{\beta}$ follow $t$-distributions, $p$-values are trustworthy, CIs have correct coverage. If violated → $p$-values and CIs may be misleading.

Sampling Distributions of $\hat{\alpha}$ and $\hat{\beta}$

The OLS estimates are random variables — different samples give different estimates.

Let $s^2 = \dfrac{SSE}{n-2}$ (residual variance) and $S_{xx} = \sum_{i}(x_i - \bar{x})^2$.

Standard errors:

\[ SE(\hat{\beta}) = \frac{s}{\sqrt{S_{xx}}} \qquad SE(\hat{\alpha}) = s\,\sqrt{\dfrac{1}{n} + \dfrac{\bar{x}^2}{S_{xx}}} \]

Why it matters

$SE(\hat{\beta})$ shrinks when $S_{xx}$ is large (wide spread in $x$), $s$ is small (tight residuals), or $n$ is large. More data spread $\Rightarrow$ sharper slope estimate.

Why Do Standard Errors Matter?

Coefficients tell you what the relationship is. Standard errors tell you how much to trust it.

1. They power every $t$-test:

\[ t_{\hat{\alpha}} = \frac{\hat{\alpha}}{SE(\hat{\alpha})} \qquad t_{\hat{\beta}} = \frac{\hat{\beta}}{SE(\hat{\beta})} \]

2. They determine CI width:

\[ \hat{\alpha} \pm t_{0.025}\!\cdot\! SE(\hat{\alpha}) \quad \hat{\beta} \pm t_{0.025}\!\cdot\! SE(\hat{\beta}) \]

Key takeaway

Coefficient = point estimate. SE = margin of error.

`model.summary()` Output

	coef	std err	t	P>\|t\|	[0.025, 0.975]
const ($\alpha$)	0.0036	0.001	2.706	0.007	0.001, 0.006
Mkt_excess ($\beta$)	2.3320	0.166	14.038	0.000	2.005, 2.659

$H_0: \alpha = 0$ vs $H_a: \alpha \ne 0$ (two-tailed)
$H_0: \beta = 0$ vs $H_a: \beta \ne 0$ (two-tailed)

Key takeaway

Each row tests whether the coefficient is significantly different from zero. The P>|t| column gives the two-tailed $p$-value.

$t$-Tests for Slope and Intercept

Under LINE, each estimate follows a $t$-distribution:

Slope ($\beta$):

\[ t_{\hat{\beta}} = \frac{\hat{\beta}}{SE(\hat{\beta})} \;\sim\; t_{n-2} \qquad H_0: \beta = 0 \]

$p$-value $< 0.05$ $\Rightarrow$ $x$ is a significant predictor.

Intercept ($\alpha$):

\[ t_{\hat{\alpha}} = \frac{\hat{\alpha}}{SE(\hat{\alpha})} \;\sim\; t_{n-2} \qquad H_0: \alpha = 0 \]

Important

In CAPM, the slope test ($H_0:\beta=0$) answers: “Does the market explain NVDA at all?” The intercept test ($H_0:\alpha=0$) answers: “Does NVDA earn a free lunch above market risk?”

What Is a $p$-Value? (Two-Tailed)

Why it matters

The $p$-value is the probability of seeing a result as extreme or more, assuming $H_0$ is true.

Decision rule:

$p < 0.05$ $\Rightarrow$ reject $H_0$ (significant)
$p \ge 0.05$ $\Rightarrow$ fail to reject $H_0$

Example: $p = 0.003$ for $\hat{\beta}$ $\Rightarrow$ only 0.3% chance of this slope if $\beta = 0$ $\Rightarrow$ $X$ matters.

Key takeaway

Small $p$ $\Rightarrow$ unlikely under $H_0$ $\Rightarrow$ reject. $p$ is not $P(H_0 \text{ is true})$.

One-Tailed $p$-Value

Sometimes we test a direction, not just “different from zero”.

Right-tailed: $H_0: \beta \le 0$ vs $H_a: \beta > 0$

If $t > 0$: $p_{\text{one}} = p_{\text{two}} / 2$
If $t < 0$: $p_{\text{one}} = 1 - p_{\text{two}} / 2$

Left-tailed: $H_0: \beta \ge 0$ vs $H_a: \beta < 0$

If $t < 0$: $p_{\text{one}} = p_{\text{two}} / 2$
If $t > 0$: $p_{\text{one}} = 1 - p_{\text{two}} / 2$

Key takeaway

statsmodels reports two-tailed $p$. For one-tailed: divide by 2 only if the sign of $t$ matches $H_a$.

Testing $\alpha > 0$ and $\beta > 1$ in CAPM

Does NVDA earn positive alpha?

$H_0: \alpha \le 0$ vs $H_a: \alpha > 0$

Is NVDA more aggressive than the market?

$H_0: \beta \le 1$ vs $H_a: \beta > 1$

Key takeaway

Default test: $H_0: \beta = 0$ (does $x$ matter?). To test a threshold like $\beta = 1$, shift the null and recompute $t = (\hat{\beta} - 1)/SE(\hat{\beta})$.

Critical Value Approach

Instead of comparing $p$ to $0.05$, compare $|t|$ to the critical value $t_{\text{crit}}$.

Two-tailed ($H_0: \beta = 0$): Reject if $|t| > t_{0.025,\,n-2}$

One-tailed right ($H_0: \beta \le 0$): Reject if $t > t_{0.05,\,n-2}$

Key takeaway

Two equivalent approaches: (1) $p$-value $< 0.05$, or (2) $|t| > t_{\text{crit}}$. Both give the same conclusion. For large $n$, $t_{\text{crit}} \approx 1.96$ (two-tailed) or $1.645$ (one-tailed).

Confidence Intervals for $\alpha$ and $\beta$

A 95% confidence interval captures the true parameter with 95% probability:

\[ \text{CI for } \alpha:\quad \hat{\alpha} \;\pm\; t_{0.025,\,n-2} \cdot SE(\hat{\alpha}) \]

\[ \text{CI for } \beta:\quad \hat{\beta} \;\pm\; t_{0.025,\,n-2} \cdot SE(\hat{\beta}) \]

Key takeaway

CI for $\beta$: $[1.66, 2.58]$ — we are 95% confident NVDA’s market sensitivity is between 1.66 and 2.58. CI for $\alpha$: $[0.0008, 0.0085]$ — does not contain 0 $\Rightarrow$ significant alpha.

Interpretation. The 95% CI for $\beta$ of $[1.66, 2.58]$ carries a precise meaning: if we repeated this estimation on 100 different 400-day samples drawn from the same process, approximately 95 of those intervals would contain the true $\beta$. This particular interval does not contain 1.0, which confirms statistically that NVDA is more aggressive than the market — we can reject the null $\beta = 1$ at the 5% level.

In real-money terms: the CI width is $2.58 - 1.66 = 0.92$. On a $-10\%$ market day, that uncertainty translates to a range of $[-25.8\%, -16.6\%]$ for NVDA’s expected loss — a $9.2$ percentage point range. That is the estimation uncertainty priced into any NVDA options position, and it is why options traders watch standard errors, not just point estimates.

The CI for $\alpha$ not containing 0 is noteworthy: it means this simulated NVDA has statistically significant positive alpha. In real markets, this is rare and transient. When it does occur, it attracts capital until the alpha is arbitraged away — which is precisely what the efficient market hypothesis predicts.

Warning

Common pitfall — misreading p-values as probability of $H_0$: A student seeing $p(\alpha) = 0.007$ might say “there is a 0.7% chance NVDA’s true alpha is zero.” This is wrong. The $p$-value assumes $H_0$ is true and computes the probability of the observed data. It says nothing about the probability that $H_0$ is true — that requires a Bayesian posterior. The correct reading is: “If true alpha were zero, there is only a 0.7% chance of observing an estimated alpha this large or larger.”

Reading the Statsmodels CI Output

model.summary() already reports everything:

	coef	std err	t	P>\|t\|	[0.025	0.975]
const ($\alpha$)	0.0036	0.001	2.71	0.007	0.0008	0.0085
Mkt_excess ($\beta$)	2.332	0.166	14.04	0.000	1.662	2.577

coef: $\hat{\alpha}$ or $\hat{\beta}$
std err: $SE$
t: coef / std err
P>|t|: two-sided p-value
[0.025, 0.975]: 95% CI

Why it matters

CI contains $0$ $\Leftrightarrow$ $p$-value $> 0.05$. They carry the same information — always check both.

Try It! — Inference on CAPM

Try it!

Using the CAPM model on NVDA data:

What is the 95% CI for $\beta$ (market sensitivity)?
Can you reject $H_0: \beta = 1$ at 5% significance? Hint: $t = (\hat{\beta} - 1) / SE(\hat{\beta})$. Compare to $t_{0.025, n-2} \approx 1.96$.
What does the CI for $\alpha$ tell you about NVDA’s risk-adjusted performance?
Re-run for TSLA. Compare $\beta_{\text{TSLA}}$ vs $\beta_{\text{NVDA}}$ — which is more aggressive?

4. CI for $\mu_Y$ vs. Prediction Interval

Background: Uncertainty About Means vs. Uncertainty About Individuals

The distinction between a confidence interval for a mean and a prediction interval for an individual observation is one of the most important — and most frequently confused — concepts in applied statistics. It maps directly onto a fundamental business distinction: are you making a portfolio decision (about the average across many instances) or a single-bet decision (about one specific case)?

A mutual fund manager asking “what is NVDA’s average return on strong market days?” needs a CI for $\mu_Y$. As the fund holds NVDA for thousands of trading days, the law of large numbers causes the realized average to converge toward the true mean — the individual noise cancels out. The CI for $\mu_Y$ narrows toward zero as the sample size grows: in the limit, the manager learns the true conditional mean exactly.

A risk manager asking “what is NVDA’s worst likely loss tomorrow?” needs a prediction interval. Tomorrow is a single observation. No amount of historical data eliminates the irreducible randomness of $\epsilon_{t+1}$. The PI never collapses — it is bounded below by $\pm z_{\alpha/2} \cdot \sigma$, the individual observation standard deviation, no matter how large $n$ grows.

This asymmetry matters enormously for financial risk management. Value-at-Risk (VaR), the standard risk measure at banks, is essentially a quantile of the prediction interval — it estimates the loss that will not be exceeded with 95% (or 99%) probability on a single future day. Using the CI for $\mu_Y$ instead of the PI would grotesquely understate the risk, producing intervals ten times too narrow.

The same distinction appears in operations and marketing. A retailer predicting demand for a product category across hundreds of stores should use CI for $\mu$. A retailer predicting demand at a single new store — used to decide its inventory position — should use the PI.

In practice

Healthcare analytics provides a clear example. A hospital administrator estimating the average length of stay for knee replacement patients (to plan staffing levels) should use the CI for $\mu_Y$. A surgeon telling this specific patient how long they will likely stay uses the PI. The PI is always wider — because a single patient has their own idiosyncratic biology, not just the average.

Two Questions, Two Intervals

Given a new predictor value $x^*$ (e.g., market return $= +1\%$), we ask two different questions:

CI for $\mu_Y \mid x^*$

Question: What is the average NVDA return on all days when the market returns $x^*$?
Target: the mean $\mu_Y = \alpha + \beta x^*$
Use when: estimating a population average.

Prediction Interval for $Y$

Question: What will NVDA return tomorrow, given the market returns $x^*$?
Target: a single future observation $Y = \mu_Y + \epsilon$
Use when: forecasting a specific future value.

The Formulas — One Extra Term

Both intervals are centered at $\hat{y}^* = \hat{\alpha} + \hat{\beta} x^*$.

CI for $\mu_Y$:

\[ \hat{y}^* \pm t_{\alpha/2,\,n-2} \cdot s\sqrt{\dfrac{1}{n} + \dfrac{(x^*-\bar{x})^2}{S_{xx}}} \]

Prediction Interval:

\[ \hat{y}^* \pm t_{\alpha/2,\,n-2} \cdot s\sqrt{\color{red}{1} + \dfrac{1}{n} + \dfrac{(x^*-\bar{x})^2}{S_{xx}}} \]

Important

The “1” in the PI formula accounts for the individual error $\epsilon$ in the new observation.

A single observation varies around the mean — even if we knew $\mu_Y$ exactly, the PI could not shrink to zero.

PI is always wider than CI.

Why PI Is Always Wider

The CI band hugs the regression line tightly; the PI band is much wider because it has to cover individual scatter around the line.

Key takeaway

As $n \to \infty$, the CI for $\mu_Y$ narrows to a line (we learn $\mu_Y$ exactly). The PI never collapses — individual variation $\sigma^2$ is irreducible.

Python: `get_prediction()` for Both Intervals

Key takeaway

mean_ci: CI for $\mu_Y$ — NVDA’s expected average on +1% market days: $[+1.7\%, +2.1\%]$ obs_ci: PI for individual $Y$ — tomorrow’s NVDA return: $[-1.1\%, +4.9\%]$

The PI is 10× wider — individual returns are noisy even when the mean is precise.

Interpretation. The output contains two fundamentally different statements about the same forecast:

CI for mean $[+1.7\%, +2.1\%]$: On days when the market gains exactly 1%, the average NVDA return across all such days historically falls in this narrow 0.4-percentage-point band. This precision reflects 400 training observations narrowing down the conditional mean.
PI for individual $[-1.1\%, +4.9\%]$: Tomorrow specifically — with its own unique news, sentiment, and idiosyncratic shocks — could land anywhere in a 6-percentage-point range. This range captures the residual standard deviation $s \approx 0.02$ (2% daily), which cannot be reduced by collecting more data.

The dramatic width difference — roughly 0.4% vs 6% — illustrates why risk management (PI) and strategy evaluation (CI) require entirely different confidence intervals from the same model. A portfolio manager using the CI to size risk would be off by a factor of 15.

Warning

Common pitfall — confusing CI for mean with PI for individual: In financial reporting and business analytics, the CI for $\mu_Y$ is often presented as if it were a forecast interval for a future observation. This understates uncertainty by a large factor. Always ask: “Am I predicting the mean of many outcomes, or a single specific future value?”

Plot Both Bands Together

Business Interpretation

Use CI for $\mu_Y$ when…

Estimating the average effect
Benchmarking: “On average, when the market is up 1%, where should NVDA trade?”
Policy evaluation (expected impact)

Use PI when…

Forecasting a specific future value
Risk sizing: “What is the worst-case NVDA loss tomorrow?”
Inventory/demand planning for one period

Try it!

If the market drops 2% tomorrow, use get_prediction() to find: (a) the 95% CI for NVDA’s expected return, and (b) the 95% PI for NVDA’s actual return. Which interval would a risk manager care about?

5. Train/Test Split & Model Evaluation

Background: The Out-of-Sample Discipline

The practice of withholding data for out-of-sample evaluation has its roots in a 1974 paper by statistician Mervin Stone, who formalized cross-validation as a method for selecting statistical models. Stone’s insight was simple but profound: a model’s performance on the data used to fit it is a biased estimate of performance on new data. The bias is called overfitting, and it grows with model complexity.

The concept became the cornerstone of machine learning. Every production ML system at Google, Amazon, and every quantitative hedge fund withholds a held-out test set that the model never sees during training. The test performance is the only performance metric that matters for deployment. The training performance is essentially irrelevant beyond diagnosing underfitting.

In financial econometrics, out-of-sample testing has special importance. The look-ahead bias in finance — using future information to fit a model and then reporting in-sample performance as a “forecast” — is one of the most pervasive sources of spurious results in the academic factor investing literature. Papers reporting “strategies” that earn 20% annualized returns in-sample routinely fail to deliver out-of-sample once transaction costs and data snooping are accounted for.

Harvey, Liu and Zhu (2016) surveyed 316 factors reported in the literature and estimated that roughly half of them are false discoveries. The discipline of train/test split — running the exact same code on held-out data before claiming any predictive result — is the minimum bar for credibility in quantitative finance.

For time-series data like stock returns, there is an additional constraint: you must not shuffle the data. Unlike cross-sectional data, financial returns have time structure. Using 2024 data to train a model and evaluating it on 2020 data is not just poor practice — it is a direct form of look-ahead bias that renders the entire exercise meaningless. Always train on earlier dates and test on later dates.

In practice

Every systematic trading strategy at Renaissance Technologies, Two Sigma, and Citadel goes through rigorous out-of-sample backtesting before any capital is allocated. “In-sample fit” is not a metric — it is expected to be good by construction. Only out-of-sample Sharpe ratio, maximum drawdown, and hit rate determine whether a strategy moves to production.

Why Split the Data?

Why it matters

Train to learn, test to verify. If you test on training data, you are grading your own homework!

Diagram: full dataset → split into 80% Train (estimate $\alpha, \beta$) and 20% Test (predict).

Key rule: Time series — train on earlier dates, test on later dates. Never shuffle!

R$^2$ and RMSE

Two key metrics:

R$^2$ (R-squared):

\[ R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2} \]

$R^2 = 1$: perfect prediction
$R^2 = 0$: no better than mean

RMSE:

\[ \text{RMSE} = \sqrt{\frac{1}{n-2}\sum (y_i - \hat{y}_i)^2} \]

Typical: $\approx 0.5\%$ for daily returns
Lower is better

Key takeaway

R$^2$ = proportion explained. RMSE = typical prediction error. Always report both.

Compute Train and Test Metrics

Key takeaway

If train and test R$^2$ are close, the model generalises well — the CAPM beta estimated on past data is stable.

Interpretation. Compare the two pairs of numbers:

Train R² ≈ 0.33, Test R² ≈ 0.32: The test R² is nearly identical to train R². This is exactly what you hope to see — the model’s ability to explain variance is stable across the two time periods. For CAPM with a single predictor and a simple financial relationship, overfitting is not a concern.
Train RMSE ≈ 0.020, Test RMSE ≈ 0.021: The prediction error is roughly 2% per trading day on the test set, slightly higher than on the training set (as expected). In a $10,000 position, a 2% daily RMSE translates to a typical prediction error of about $200 per day.

A large train-test gap would signal overfitting: perhaps a model with many predictors that memorized training-set noise without learning generalizable patterns. With CAPM’s single predictor, the model is too simple to overfit but might underfit — leaving systematic patterns unexplained (which is why Fama-French adds more factors).

Warning

Time-series rule: For stock returns, always sort by date and train on earlier data. If you shuffle before splitting, you will leak future information into the training set (look-ahead bias), and your test performance will be artificially inflated.

Overfitting vs. Underfitting

	Underfitting	Overfitting
Cause	Model too simple	Model too complex
Train R$^2$	Low	High
Test R$^2$	Also low	Much lower than Train
Fix	Add predictors / nonlinear terms	Remove predictors / regularise

Three regimes:

Underfitting (too simple): flat line misses the trend
Good fit (just right): gentle smooth trend
Overfitting (too complex): wiggly curve chasing noise

Detecting Overfitting: Train vs. Test Gap

As model complexity grows, training error keeps falling but test error follows a U-shape — falling, then rising. The “sweet spot” is the minimum test error.

Important

Warning signs of overfitting:

Train R$^2$ much higher than Test R$^2$
Adding more variables helps Train but hurts Test
Many insignificant predictors in the model

Key takeaway

The goal is the sweet spot: complex enough to capture real patterns, simple enough to generalise. Adjusted R$^2$ and cross-validation help find it.

6. Assumptions & Residual Analysis

Background: The Gauss-Markov Theorem and What It Guarantees

The theoretical foundation of OLS lies in the Gauss-Markov theorem (1900, formalized by David Hilbert): under the four LINE assumptions, OLS produces the BLUE estimator — Best Linear Unbiased Estimator. “Best” means lowest variance among all linear unbiased estimators. This is not a trivial guarantee: it says you cannot do better than OLS with linear estimators if the assumptions hold.

Breaking down what each assumption guarantees when violated:

L — Linearity: If $E[\epsilon | x] \neq 0$, then $\hat{\beta}$ is biased — it systematically over- or underestimates the true slope. The Gauss-Markov theorem fails entirely. No amount of data fixes a misspecified model; you are estimating the wrong thing. This is the most serious violation. Remedy: add polynomial terms, interaction terms, or use a nonlinear model.

I — Independence: Correlated errors (serial correlation in time series) mean the effective sample size is smaller than $n$. Standard errors computed assuming independence are too small — the model appears more precise than it actually is. $p$-values are falsely optimistic. In financial returns, ARCH/GARCH volatility clustering is a direct violation of independence. Remedy: use robust standard errors (Newey-West HAC).

N — Normality: Non-normal residuals affect the distribution of $t$ and $F$ statistics, not their means. For large $n$, the Central Limit Theorem saves you: $\hat{\beta}$ is asymptotically normal regardless of $\epsilon$’s distribution. The concern is mainly in small samples. Fat tails in stock returns (which are well-documented) mean that confidence intervals may have less than their nominal 95% coverage.

E — Equal variance (homoscedasticity): Heteroscedasticity — where $\text{Var}(\epsilon_i)$ depends on $x_i$ — leaves $\hat{\beta}$ unbiased but makes standard errors wrong. OLS is no longer BLUE (it is inefficient; Weighted Least Squares would do better). More practically, $t$-tests and confidence intervals are unreliable. Remedy: White’s robust standard errors (cov_type='HC3' in statsmodels).

This diagnostic framework is not just academic formalism. Before presenting any regression result to a board, a CFO, or a regulator, a competent analyst always runs residual diagnostics and reports the results — explicitly noting any violations and the remedies applied.

Warning

Heteroskedasticity in financial data: Almost all financial return series exhibit heteroscedasticity. Volatility in 2008 was far higher than in 2006. OLS using raw returns as $y$ will produce standard errors that are wrong. The standard practice is to use HAC-robust standard errors (model.fit(cov_type='HAC', cov_kwds={'maxlags': 5})), which are consistent in the presence of both heteroscedasticity and autocorrelation.

The LINE Assumptions

Regression is trustworthy when residuals ($e_i = y_i - \hat{y}_i$) satisfy:

L Linearity — $E[\epsilon \mid x] = 0$
I Independence — residuals uncorrelated
N Normality — $\epsilon \sim N(0, \sigma^2)$
E Equal variance — $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$

Important

For stock returns: I often violated (volatility clustering). N: daily returns have fat tails.

Assumption L: Linearity — Residuals vs. Fitted

Good	Bad
Random scatter around 0	Curved (U-shape) pattern

Key takeaway

How to read: Plot residuals ($e$) against fitted values ($\hat{y}$). Want a random cloud around 0. A curve means the linear model misses a nonlinear pattern — consider adding polynomial or transformed terms.

Assumption I: Independence — Residuals Over Time

Good	Bad
Random fluctuations	Volatility clustering (bunches of large residuals)

Key takeaway

How to read: Plot residuals against time. Want no patterns or clusters. Bunches of large residuals = volatility clustering (very common in daily stock returns). Runs of same sign = autocorrelation. Both violate independence $\Rightarrow$ standard errors are unreliable.

Assumption N: Normality — Q-Q Plot & Histogram

Good	Bad
Points on the diagonal line	S-shape = fat tails

Key takeaway

How to read: In a Q-Q plot, each point compares a sample quantile to the corresponding normal quantile. Points on the diagonal = normal. Deviations at the tails = heavy/fat tails (common in daily stock returns). Matters most for small samples; large $n$ benefits from the CLT.

Assumption E: Equal Variance — Scale-Location Plot

Good (Homoscedastic)	Bad (Heteroscedastic)
Constant band width	Fan shape — $\sigma$ grows

Key takeaway

How to read: Residuals should have the same spread across all fitted values. A funnel/fan shape means variance depends on $x$ (heteroscedasticity) $\Rightarrow$ OLS standard errors are biased $\Rightarrow$ $p$-values and CIs are unreliable. Fix: use robust (HC) standard errors or transform $y$.

Checking Assumptions with Python

LINE Diagnostics — CAPM Output

Four-panel diagnostic plot: residuals vs fitted (top-left), residuals over time (top-right), Q-Q plot (bottom-left), scale-location (bottom-right).

Interpretation. Read each panel systematically:

Top-left (L — Linearity): Random scatter around the horizontal red line at zero confirms linearity. If you see a U-shape or inverted-U-shape, the relationship between NVDA and SPY is nonlinear — perhaps requiring a squared term or a regime-switching model. For simulated CAPM data with a linear DGP, this panel should look clean.
Top-right (I — Independence): Plot residuals against the observation index (time). Look for “runs” — sequences of same-sign residuals — or volatility clustering (quiet periods followed by stormy periods). In real daily return data, you will almost always see GARCH-type clustering here. If this panel shows bunches of large residuals followed by bunches of small residuals, independence is violated.
Bottom-left (N — Normality, Q-Q plot): Points should follow the diagonal line closely. The most common deviation in finance is S-shaped tails (points above the line at both ends), indicating fat tails — more extreme returns than the normal distribution predicts. This is not surprising: stock returns are famously leptokurtic (kurtosis > 3).
Bottom-right (E — Equal variance): A horizontal band of constant width confirms homoscedasticity. A funnel shape — wider scatter at higher fitted values — indicates heteroscedasticity. For CAPM, heteroscedasticity often appears because high-beta days have higher variance in NVDA’s returns.

Warning

Outliers, leverage, and influence are different things: An outlier has a large residual (far from the fitted line vertically). A high-leverage point has an unusual $x$ value (far from $\bar{x}$). An influential point has large Cook’s distance — removing it would substantially change $\hat{\beta}$. A point can be an outlier without being influential (if it is near the middle of the $x$ range), and it can be influential without being an outlier (if it is extreme in $x$ and happens to lie exactly on the regression line). Check Cook’s distance with model.get_influence().cooks_d for any analysis where individual observations might dominate the fit.

What Happens When Assumptions Fail?

Violated	Effect on Inference	Consequence
L Linearity	$\hat{\beta}$ is biased; model misses pattern	Predictions systematically wrong; tests test the wrong model
I Independence	SEs are wrong (usually too small)	$p$-values too optimistic; CIs too narrow
N Normality	$t$/$F$-tests are approximate	Unreliable in small samples; OK for large $n$ (CLT)
E Equal var.	OLS inefficient; SEs biased	Wrong CI width; $p$-values unreliable

Important

L and I are the most damaging: they make the estimates themselves wrong or the uncertainty estimates wrong. Always check these first.

Key takeaway

N and E are less severe for large $n$ (CLT helps). Always check L and I first — they are the hardest to fix.

Assumptions Hold vs. Fail: The Big Picture

Assumptions Hold	Assumptions Fail
$\hat{\beta}$ is unbiased	L fails $\Rightarrow$ $\hat{\beta}$ biased
SEs are correct	I or E fails $\Rightarrow$ SEs wrong
$t$-tests & $F$-tests are exact	N fails $\Rightarrow$ tests approximate
CIs have correct coverage	CIs may under/over-cover
Predictions are optimal	OLS no longer best estimator

Key takeaway

OLS always gives numbers — but those numbers are only trustworthy if the assumptions approximately hold. Diagnostic plots let you check before trusting the output.

7. Multiple Linear Regression: Fama-French

Background: From One Factor to Five

The move from CAPM (one predictor) to multiple regression parallels the intellectual history of empirical asset pricing. CAPM was dominant through the 1970s. Then systematic anomalies began to accumulate: stocks with low market capitalization (small caps) consistently outperformed CAPM predictions; stocks trading at low price-to-book ratios (value stocks) did the same. These “anomalies” could not be explained by the market factor alone.

Eugene Fama and Kenneth French (1992, Journal of Finance) documented these patterns rigorously and introduced the three-factor model: market factor, SMB (Small-Minus-Big), and HML (High-Minus-Low book-to-market). Each factor is a zero-cost long-short portfolio: go long small-cap stocks and short large-cap stocks (SMB); go long value stocks and short growth stocks (HML). These portfolios earned positive returns over long periods — returns that CAPM, pricing only systematic market risk, could not explain.

Fama and French’s interpretation was academic: size and value are proxies for risk factors not captured by the market. Their critics — led by behavioral economists like Robert Shiller (who shared the 2013 Nobel with Fama) — argued these returns reflect investor irrationality: small caps and value stocks are neglected or unfashionable, and their anomalous returns reflect sentiment rather than risk.

This debate has never been fully resolved, but it has been empirically productive. Fama and French extended their model to five factors in 2015, adding: - RMW (Robust-Minus-Weak profitability): profitable firms outperform unprofitable ones - CMA (Conservative-Minus-Aggressive investment): firms with low capital expenditure outperform those investing heavily

The five-factor model currently explains ~90% of cross-sectional return variation across US stocks — far more than CAPM’s ~70%. It has become the standard benchmark for evaluating whether a new trading strategy earns genuine alpha or simply captures known risk premia.

In institutional investing, Fama-French factors are not just academic benchmarks — they are products. Dimensional Fund Advisors (DFA), co-founded by Eugene Fama’s collaborators, manages over $700 billion by systematically tilting portfolios toward small-cap and value stocks. AQR Capital Management (Asness, Kabiller, Pedersen) runs a broader factor strategy (“AQR Style Premia”) based on momentum, value, carry, and defensive factors, managing over $100 billion. BlackRock’s iShares runs dozens of factor ETFs. The multiple regression framework you learn in this section directly underlies the investment processes of firms managing combined assets exceeding $1 trillion.

In practice

The shift from CAPM to Fama-French at institutional asset managers was not just academic: it changed how firms attribute performance. A manager claiming to earn alpha against a one-factor CAPM benchmark might simply be tilting toward small-cap or value stocks. Against a Fama-French five-factor benchmark, genuine stock-picking alpha is far harder to demonstrate. This is why the choice of benchmark model is itself a contentious business and regulatory issue.

Why Do We Need Multiple Regression?

CAPM uses one factor (market return). But NVDA’s return may also depend on:

Firm size — small vs large cap
Value vs growth orientation
Profitability and investment

CAPM $R^2 \approx 33\%$ — the market alone leaves 67% unexplained. Can we do better?

Important

Omitted variable bias: if a missing variable correlates with both $X$ and $Y$, $\hat{\beta}$ is biased.

Key takeaway

MLR controls for multiple factors, giving each $\beta_j$ a “holding all else constant” interpretation. CAPM → FF5.

Multiple Linear Regression (MLR)

Simple regression uses one predictor. In practice, outcomes depend on many variables:

\[ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \epsilon \]

$\alpha$ = intercept: predicted $y$ when all $x_j = 0$
$\beta_j$ = partial slope: change in $y$ per unit change in $x_j$, holding all other $x$’s constant
$\epsilon$ = error (same LINE assumptions as SLR)

Conceptual diagram: $x_1, x_2, \ldots, x_k$ each carry a weight $\beta_j$ into the prediction $\hat{y}$.

Interpreting Coefficients: “Ceteris Paribus”

The key difference from SLR: each $\beta_j$ is a partial effect:

Why it matters

Example: Suppose we predict house price:

$\hat{y} = 50 + 0.1 \times \text{sqft} + 15 \times \text{bedrooms}$

$\beta_{\text{sqft}} = 0.1$: each extra sq ft adds $100, holding bedrooms constant.

$\beta_{\text{bed}} = 15$: each extra bedroom adds $15k, holding sqft constant.

Important

Without “holding constant”, $\beta_j$ in MLR is not the same as $\beta$ in SLR.

Adding correlated predictors changes all coefficients — this is not a bug, it is controlling for confounders.

Fitting MLR in Python

# General MLR: identical syntax -- just pass multiple columns
# y = b0 + b1*x1 + b2*x2 + ... + bk*xk

X = sm.add_constant(df[["x1", "x2", "x3"]])  # k predictors
model = sm.OLS(y, X).fit()
print(model.summary())

# Key outputs:
# model.params        -> coefficients (beta_0, ..., beta_k)
# model.pvalues       -> p-value for each beta_j
# model.rsquared_adj  -> Adjusted R-squared
# model.conf_int()    -> 95% CIs for all betas

Key takeaway

The code is identical to SLR — just pass a DataFrame with $k$ columns instead of one. OLS handles everything.

Live Demo: Predicting Pharmacy Profit

Business question: A pharmacy chain has 111 branches across US metro areas. What local demographics predict per-branch profit?

Predictors (6 features): Income, Disposable Income, Birth Rate, Soc Security, CV Death, % 65 or Older.

Target: Profit per branch.

Why this dataset?

Real business case, small enough to read by eye, large enough to need multiple regression. Profit depends on multiple local features — no single one tells the full story.

Fit the Pharmacy Multiple Regression

After running, look at:

R-squared — overall fit
Adj. R-squared — penalised for k
Coefficients under coef — direction & magnitude
P>|t| — which features are statistically significant
F-statistic — joint significance of all predictors

Interpreting the Pharmacy Model

Reading the coefficients

Positive coef = feature increases profit, holding others constant
Negative coef = feature decreases profit
Insignificant (p > 0.05) = not enough evidence this feature matters once others are accounted for

Interpretation. The pharmacy regression output delivers a complete demographic portrait of pharmacy profitability. Work through the coefficients economically:

Income coefficient (typically positive and significant): Each additional $1,000 in median household income adds a predictable dollar amount to expected branch profit, holding birth rate, social security density, and age distribution constant. Higher-income neighborhoods can afford to fill more prescriptions and spend more on over-the-counter products. This is not simply because richer areas have more spending power overall — controlling for other demographics, income has an independent partial effect.
% 65 or Older (likely positive): Older populations fill more prescriptions per capita. The coefficient here represents the marginal effect of one additional percentage point in the elderly share, holding income and birth rate constant. A 1-point increase in elderly share in a metro area has a definable dollar impact on branch profit.
Birth Rate (ambiguous sign): Areas with high birth rates may have young, generally healthy populations with fewer chronic medications. The sign and significance here tests whether younger demographics are less profitable for pharmacies — a hypothesis with direct implications for site selection.
F-statistic (overall significance): If the F-test p-value is below 0.05, at least one of the six demographic features genuinely predicts profit. For this dataset, the F-statistic is typically highly significant — the combined demographic model explains a meaningful share of profit variation.
Adjusted $R^2$ vs $R^2$: Compare both. If Adj-$R^2$ is substantially below $R^2$, some predictors are free-riding — adding noise without explanatory power. Dropping them will raise Adj-$R^2$.

Warning

Multicollinearity in demographic data: Income, disposable income, and social security density are likely highly correlated across metro areas. This can inflate standard errors and make individual coefficients unstable. Always run VIF diagnostics on demographic regression models. If VIF > 10 for Income and Disposable Income, consider using only one of them (or constructing a composite).

Try it!

Drop the insignificant features and refit. Does Adjusted R² go up or down? What does that tell you about model parsimony?

Predict Profit for a New Branch

Suppose a new branch opens in a city with these demographics. What does the model predict?

CI vs PI for the business

CI for mean — “If we opened many branches with these demographics, average profit would be in this range.” Used for portfolio decisions.

PI for individual — “This specific branch’s profit will be in this range.” Wider — more cautious for single-branch decisions.

R$^2$ vs. Adjusted R$^2$

Problem: R$^2$ always increases when you add variables — even useless ones!

R$^2$:

\[ R^2 = 1 - \frac{SSE}{SST} \]

Adding any variable can only decrease SSE, so R$^2$ never drops.

Adjusted R$^2$:

\[ \bar{R}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1} \]

Penalises for $k$. Can decrease if a new variable does not help enough.

Key takeaway

Use $\bar{R}^2$ to compare models with different numbers of predictors.

Multicollinearity

Problem: When predictors are correlated with each other, OLS still works but coefficients become unstable.

Symptoms:

Large standard errors $\Rightarrow$ wide CIs
Coefficients flip sign or change wildly when a variable is added/removed
High R$^2$ but many insignificant $p$-values

Variance Inflation Factor:

\[ \text{VIF}_j = \frac{1}{1 - R_j^2} \]

where $R_j^2$ = R$^2$ from regressing $x_j$ on all other $x$’s.

VIF	Interpretation
$< 5$	OK
$5$–$10$	Moderate
$> 10$	Severe

Important

Multicollinearity does not bias predictions — it only inflates uncertainty around individual $\beta_j$’s.

Detecting Multicollinearity in Python

from statsmodels.stats.outliers_influence import (
    variance_inflation_factor)

# Compute VIF for each predictor
X_vif = sm.add_constant(df[["x1", "x2", "x3"]])
for i, col in enumerate(["x1", "x2", "x3"]):
    vif = variance_inflation_factor(X_vif.values, i + 1)
    print(f"{col}: VIF = {vif:.2f}")

Key takeaway

Rule of thumb: If any VIF $> 10$, consider dropping or combining the offending variable. VIF $< 5$ is generally safe.

Understanding $R_j^2$ in VIF — A Simple Example

Setup: 3 predictors — $x_1$ (age), $x_2$ (income), $x_3$ (credit score). To get VIF for $x_2$: regress $x_2$ on $x_1, x_3$, get $R_2^2$, then $\text{VIF}_2 = 1/(1 - R_2^2)$.

Predictor	$R_j^2$	VIF	Status
$x_1$ (age)	0.10	1.11	OK
$x_2$ (income)	0.85	6.67	Warn
$x_3$ (credit)	0.82	5.56	Warn

Why it matters

$R_j^2 = 0.85$: 85% of income explained by others $\Rightarrow$ redundant. $R_j^2 = 0.10$: age is independent of others.

Key takeaway

High $R_j^2$ $\Rightarrow$ $x_j$ well-predicted by others $\Rightarrow$ high VIF $\Rightarrow$ inflated $SE(\hat{\beta}_j)$. Fix: drop or combine correlated predictors.

Application: From CAPM to Fama-French

CAPM: one factor explains $$33%. What else matters?

Fama-French 5-Factor Model:

Mkt-RF: Market excess return
SMB: Small-minus-Big (size)
HML: High-minus-Low (value)
RMW: Robust-minus-Weak (profit)
CMA: Conservative-minus-Aggressive

\[ R - R_f = \alpha + \beta_1 \text{Mkt} + \beta_2 \text{SMB} + \beta_3 \text{HML} + \beta_4 \text{RMW} + \beta_5 \text{CMA} + \epsilon \]

Why it matters

Each $\beta_j$ captures a different type of systematic risk. Together they explain more than the market alone.

What Each Factor Means

Factor	Long / Short	Intuition
Mkt-RF	Broad market	Core market risk
SMB	Small / large caps	Small firms tend to outperform
HML	High B/M / low B/M	Value beats growth
RMW	High / low profit	Profitable firms earn more
CMA	Low / high investment	Conservative firms outperform

Key takeaway

For tech growth (NVDA): expect large $\beta_1$, negative $\beta_2$ (large cap), negative $\beta_3$ (growth $\ne$ value).

Load Fama-French Factors

import pandas_datareader.data as web

ff = web.DataReader("F-F_Research_Data_5_Factors_2x3_daily",
    "famafrench", start="2023-01-01", end="2024-12-31")
ff_factors = ff[0] / 100   # percent -> decimal

data_ff = ret[["NVDA"]].join(ff_factors, how="inner")
data_ff["NVDA_excess"] = data_ff["NVDA"] - data_ff["RF"]
print(data_ff.head(3))

Why it matters

pandas_datareader pulls the official Fama-French factors from Ken French’s data library.

Build the Fama-French Regression

Key takeaway

The code is identical to simple regression — just pass a list of five factor columns. statsmodels handles everything.

Interpretation. The Fama-French five-factor regression output is read the same way as CAPM — but now with five slopes instead of one. Each coefficient is a partial effect: what happens to NVDA’s excess return when this factor moves by 1%, holding the other four factors constant.

Typical patterns for a large-cap, high-growth technology stock like NVDA:

$\hat{\beta}_{\text{Mkt}} \approx 1.84$: NVDA amplifies market moves nearly 2-to-1, even after controlling for size, value, profitability, and investment.
$\hat{\beta}_{\text{SMB}} \approx -0.32$: Negative loading on SMB means NVDA behaves like a large-cap stock — it underperforms on days when small caps beat large caps. This is consistent with NVDA being a mega-cap in the S&P 500.
$\hat{\beta}_{\text{HML}} \approx -0.65$: Negative loading on HML means NVDA behaves like a growth stock — it underperforms on value-factor up days. Tech stocks trade at high price-to-book ratios, making them anti-correlated with the value premium.
$\hat{\beta}_{\text{RMW}} \approx -0.48$: Negative loading on profitability suggests NVDA is classified as a speculative growth stock — its high price implies investors are paying for future earnings, not current profitability.
$\hat{\alpha} \approx 0.0006$, $p \approx 0.56$: NVDA’s alpha becomes insignificant once the five factors are controlled. This is the EMH at work: what looked like alpha in CAPM (because CAPM omitted SMB, HML, etc.) is actually just compensation for loading on size, value, and profitability risk factors.

Interpreting the Factor Loadings

Factor	$\beta$	p-value	Meaning
const ($\alpha$)	0.0006	0.56	No significant alpha
Mkt-RF	1.84	0.000	High market sensitivity
SMB	$-0.32$	0.03	Behaves like large-cap
HML	$-0.65$	0.000	Strong growth tilt
RMW	$-0.48$	0.01	Lower vs. high-profit
CMA	$-0.22$	0.18	Not significant

Key takeaway

Each $\beta_j$: “holding other factors constant, a 1-unit increase in this factor changes NVDA excess return by $\beta_j$.”

Testing Factor Significance

Individual $t$-test (same as SLR):

\[ H_0: \beta_j = 0 \quad\text{vs}\quad H_a: \beta_j \ne 0 \qquad t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \;\sim\; t_{n-k-1} \]

Overall $F$-test (are any factors useful?):

\[ H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \qquad F = \frac{SSR/k}{SSE/(n-k-1)} \;\sim\; F_{k,\,n-k-1} \]

Key takeaway

$F$-test significant $\Rightarrow$ at least one factor matters. Individual $t$-tests tell you which ones. CMA ($p = 0.18$): candidate for dropping.

Interpretation. The F-statistic and individual $t$-tests provide complementary information:

F-statistic: An F-value of, say, 60 with $p < 0.001$ means it is essentially impossible that all five factor loadings are zero simultaneously. The five-factor model as a whole is highly significant. This rules out the most degenerate scenario — a model that explains nothing.
Individual $t$-tests: Now go factor by factor. Mkt-RF ($p \approx 0.000$): clearly essential. SMB ($p \approx 0.03$): marginally significant. HML ($p \approx 0.000$): essential. RMW ($p \approx 0.01$): significant. CMA ($p \approx 0.18$): not significant at 5%. This is the natural candidate for removal.

The CMA insignificance makes economic sense for NVDA: the Conservative-Minus-Aggressive investment factor penalizes high-capex firms. NVDA’s massive R&D and fab-less manufacturing model makes its capex profile distinctive, but the CMA exposure may be noise in the simulated data, or genuinely irrelevant for NVDA specifically.

Dropping CMA and re-running as a four-factor model (FF4) is the data-driven recommendation. Confirm with BIC or Adj-$R^2$ that the FF4 outperforms FF5 on those criteria — and use the model comparison table at the end of this section to verify.

Multicollinearity Check for FF5 Factors

Why it matters

FF5 factors are constructed to be approximately orthogonal, so VIFs are typically low. But HML and CMA can be correlated (both relate to firm investment).

Key takeaway

If VIF $> 10$: drop or combine. Here all VIFs are low $\Rightarrow$ multicollinearity is not a problem for NVDA’s factor model.

Interpretation. The VIF output for FF5 factors typically shows values between 1.0 and 2.5 — well below the warning threshold of 5. This is not accidental: Fama and French deliberately constructed these factors to be orthogonal. Each long-short portfolio is designed to capture one dimension of the return cross-section independently of the others.

In contrast, a demographic regression (Income, Disposable Income, Soc Security density) will typically show much higher VIFs because these variables are economically related — rich neighborhoods tend to have high disposable income and high elderly share. High VIFs in the pharmacy regression would explain why income appears insignificant even though you know income predicts pharmacy usage: its effect is being shared with the correlated disposable income variable.

Warning

VIF > 10 does not mean your predictions are wrong. Multicollinearity inflates standard errors on individual coefficients, making their signs unreliable and $p$-values misleading. But the model’s overall predictive accuracy ($R^2$, RMSE, out-of-sample performance) is unaffected by multicollinearity. The pitfall is using a multicollinear model to interpret which predictor “matters most” — the answer changes dramatically when you drop one correlated variable.

CAPM vs. FF5: Does MLR Help?

Key takeaway

Compare using Adjusted R$^2$ (not R$^2$, which always increases). If FF5 $\bar{R}^2 >$ CAPM $\bar{R}^2$, the extra factors genuinely add explanatory power beyond the market alone.

Interpretation. The CAPM vs FF5 comparison is the central empirical lesson of this chapter. Typical results:

Model	$R^2$	Adj $R^2$	What it means
CAPM	0.33	0.33	Market explains 33% of NVDA daily variance
FF5	0.58	0.57	Five factors explain 57-58%, a 24-point gain

The 24-percentage-point improvement in adjusted $R^2$ means that knowing NVDA’s size, value, profitability, and investment exposures adds genuine predictive information beyond the market alone. The improvement is not artifactual (adjusted $R^2$ already penalizes the four extra parameters). This is the quantitative justification for why institutional managers moved from CAPM to factor models.

From a risk management perspective, the higher-$R^2$ FF5 model also produces smaller residuals — the idiosyncratic risk after factor adjustment. A lower $\hat{\sigma}$ means tighter prediction intervals and more precise alpha estimates. The standard error of the FF5 alpha estimate should be smaller than the CAPM alpha standard error, making alpha tests more powerful.

Warning

$R^2$ always increases when you add variables, so never compare models on raw $R^2$ alone. Use adjusted $R^2$, AIC, or BIC. The table above shows that raw $R^2$ goes from 0.33 to 0.58 (adding four variables always improves raw $R^2$), but adjusted $R^2$ confirms the improvement is genuine: 0.33 to 0.57 after penalizing for the four additional parameters.

CI for $\mu_Y$ and Prediction Interval in MLR

The SLR formulas extend naturally to MLR. For a new observation $\mathbf{x}^* = (x_1^*, \ldots, x_k^*)$:

CI for $\mu_Y \mid \mathbf{x}^*$

Average NVDA return on all days with these factor values. Width depends on $SE(\hat{\mu})$ only — shrinks with more data.

Prediction Interval for $Y$

Tomorrow’s NVDA return given these factor values. Adds $\sigma^2$ for individual noise — always wider than CI.

8. Variable Selection

Background: The Information-Theoretic and Bayesian Foundations

Model selection — choosing which predictors to include — is one of the deepest problems in statistics. Two criteria dominate applied practice, each with a distinct theoretical foundation rooted in very different ideas.

AIC (Akaike Information Criterion) was introduced by Hirotugu Akaike in 1973, building on ideas from information theory and Kullback-Leibler divergence. Akaike’s insight was to frame model selection as an estimation problem: which model minimizes the expected information loss when approximating the true data-generating process? The resulting criterion penalizes the log-likelihood by $2k$ (twice the number of parameters), trading off fit against parsimony. AIC is efficient: in large samples, the model minimizing AIC produces the lowest out-of-sample prediction error. It is the right criterion when your goal is forecasting.

BIC (Bayesian Information Criterion) was introduced by Gideon Schwarz in 1978 using a Bayesian argument. Schwarz asked: which model maximizes the marginal likelihood (the probability of the data under the model, averaged over the prior)? For a flat prior over models, Laplace approximation yields a penalty of $k \ln n$ — growing with both the number of parameters and the sample size. BIC is consistent: as $n \to \infty$, BIC selects the true model with probability approaching 1, provided the true model is in the candidate set. It is the right criterion when your goal is identifying the true model structure.

No criterion is simultaneously efficient and consistent — this is a fundamental statistical impossibility (the Yang 2005 impossibility theorem). The practical choice: for forecasting NVDA returns, minimize AIC. For deciding which Fama-French factors genuinely belong in the return-generating process, use BIC (which will typically favor a sparser model, penalizing the extra factors more heavily).

The history of variable selection algorithms: Best subset selection examines all $2^p$ subsets — feasible for $p \le 30$ but computationally prohibitive for large $p$. Forward stepwise selection was introduced as a greedy alternative in the 1960s; backward elimination followed. These greedy algorithms do not guarantee finding the globally optimal model, but they are fast and usually perform well in practice. The 2000s brought LASSO (Tibshirani 1996), which combines continuous variable selection with coefficient shrinkage via an L1 penalty — effectively solving the best subset problem as a convex optimization. The 2010s brought elastic net, group LASSO, and many ML extensions. The examples in this chapter implement the classical best-subset approach in plain Python with statsmodels; for large-$p$ settings, LASSO (sklearn.linear_model.Lasso) is the modern benchmark.

In practice

At AQR and BlackRock’s factor research teams, model selection is never done purely by AIC/BIC on a single dataset. Economic intuition (is there a risk-based or behavioral story for this factor?), replication across different markets and time periods, and out-of-sample backtesting all inform which factors make it into a production model. Pure statistical selection from historical data without economic discipline is a recipe for data mining and false discoveries.

Overall F-Test: Is the Model Significant?

Test whether any predictor is useful:

\[ H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \quad\text{vs}\quad H_a: \text{at least one } \beta_j \neq 0 \]

\[ F = \frac{SSR/k}{SSE/(n-k-1)} = \frac{R^2/k}{(1-R^2)/(n-k-1)} \;\sim\; F_{k,\,n-k-1} \]

Key takeaway

The F-test is in every model.summary(). An insignificant F (p $>$ 0.05) means none of your variables explains the response — the model is useless.

Best Subset Selection: The Idea

Goal: Find the single best model of each size $k = 1, 2, \ldots, p$.

Algorithm:

For each subset size $k$, enumerate all $\binom{p}{k}$ models
Keep the best model of each size (lowest SSE) $\Rightarrow$ $p$ candidate models
Select among candidates using a criterion that penalises complexity: AIC, BIC, Adj-R$^2$, or Mallow’s $C_p$

Visual: plot every candidate model’s Adj $R^2$ vs $k$; an upper-envelope “frontier” gives the best at each size, and the optimal complexity is the peak of that frontier.

Best Subset Selection in Plain Python

The algorithm is short enough to write from scratch in 25 lines of statsmodels and itertools. For each subset size $k$ we enumerate all $\binom{p}{k}$ models, fit each with OLS, and record AIC, BIC, and adjusted $R^2$. Then we pick the overall winner per criterion.

Why it matters

BIC penalises complexity more than AIC — it tends to pick a sparser model. They may agree (good) or disagree (check both).

Model Selection Criteria

Recall: $SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ (sum of squared residuals), $k$ = number of predictors.

Criterion	Formula	Penalises	Select
Adj R$^2$	$1 - \tfrac{(1-R^2)(n-1)}{n-k-1}$	Weak	Maximise
AIC	$n\ln(\tfrac{SSE}{n}) + 2k$	Moderate	Minimise
BIC	$n\ln(\tfrac{SSE}{n}) + k\ln n$	Strong	Minimise
$C_p$	$\tfrac{SSE}{\hat\sigma^2} - n + 2k$	Moderate	$C_p \approx k$

Key takeaway

AIC optimises prediction accuracy. BIC optimises model identification (consistent). For small samples ($n < 40$) use BIC. For forecasting use AIC (or cross-validation).

Reading the Criteria Table

What does “Penalises” mean? Each criterion has a fit term (always improves with more $x$’s) and a penalty term that punishes complexity. “Penalises” rates how strongly that penalty pushes back.

Why “Select: $C_p \approx k$”?

Theorem: if the model is correctly specified, $E[C_p] \approx k$. So we look for $C_p$ close to its own $k$:

$C_p \gg k$ $\Rightarrow$ missing important variables (biased)
$C_p \approx k$ $\Rightarrow$ well-specified
$C_p < k$ $\Rightarrow$ possibly overfitting

What does “consistent” mean for BIC?

As $n \to \infty$, BIC picks the true model with probability $\to 1$. AIC does not.

Why? BIC’s penalty $k\ln n$ grows with $n$, eventually crushing any spurious variable. AIC’s $2k$ stays fixed.

Trade-off: AIC is efficient (best predictions); BIC is consistent (right model). No criterion is both.

Key takeaway

For forecasting NVDA returns → AIC. For which factors really drive returns (inference) → BIC.

Model Comparison: CAPM vs. FF4 vs. FF5

Model	$k$	Train Adj R$^2$	Test R$^2$	BIC rank
CAPM	1	0.543	0.521	3
FF4	4	0.582	0.563	1
FF5	5	0.581	0.558	2

FF4 wins on all criteria
Dropping CMA improves test R$^2$ (less overfitting)
CAPM is too simple (high bias)

Key takeaway

Best model = FF4. Extra factors add variance without reducing bias. Occam’s razor confirmed by data.

9. Business Applications

Application 1: Predict NVDA Return on a New Day

Scenario: Market up 1.5% — what do we expect from NVDA?

Why it matters

With $\beta \approx 2.33$, on a $+1.5\%$ market day NVDA is predicted to gain $\approx +3.5\%$.

Interpretation. This single prediction code block implements the daily workflow of a quantitative equity trader:

Observe the day’s factor realizations (market up 1.5%, small caps slightly down, growth outperforming value).
Plug into the fitted FF5 model to get the point forecast.
Add back $R_f$ to convert from excess return to total return.

The output, roughly $+3.5\%$ expected total return, is not a trading signal by itself — it is the benchmark prediction given the day’s factor moves. A trader who thinks NVDA will return $+5\%$ on this day is implicitly betting on $+1.5\%$ of positive alpha above what the factors explain. The model makes that alpha explicit and measurable.

In practice

This exact workflow — factor-model expected return as baseline, with position sizing based on the alpha conviction — is the core logic of statistical arbitrage at quantitative equity funds. The model does not need to predict absolute returns accurately; it only needs to identify which stocks will outperform or underperform their factor-predicted returns. That relative signal, aggregated across hundreds of stocks, generates consistent alpha even when individual predictions are noisy.

Application 2: Test for Significant Alpha

Question: Does NVDA consistently outperform the Fama-French model?

Key takeaway

A significant $\alpha$ means NVDA has genuine excess returns beyond its factor exposures. In an efficient market, this is rare.

Interpretation. The alpha test is the regression’s most consequential output for a fund manager. Break down the logic:

$\hat{\alpha} \approx 0.0006$ daily: annualized, this is approximately $0.0006 \times 252 \approx 15\%$ per year. That sounds extraordinary — because it is. In the simulation, the DGP has near-zero true alpha, so this result will typically be insignificant ($p > 0.05$), which is the correct finding for an efficient market.
If $p > 0.05$: “No significant alpha” means we cannot reject the hypothesis that NVDA’s returns are fully explained by its five-factor risk loadings. This does not mean NVDA has no alpha — it means the data do not provide strong enough evidence to conclude it does. Absence of evidence is not evidence of absence.
If $p < 0.05$: This is the rare and exciting case. It means NVDA has returned more than its risk exposures justify — a genuine free lunch above market, size, value, profitability, and investment risk. In practice, this gets arbitraged away quickly once it becomes known.

The transition from $p_{\alpha} < 0.05$ in CAPM to $p_{\alpha} > 0.05$ in FF5 is itself informative: NVDA’s apparent “alpha” over CAPM turns out to be a factor exposure (negative HML loading — growth orientation). Once you control for that, the alpha disappears. This is the Fama-French model’s primary contribution: correctly attributing apparent alpha to risk factor exposures.

Warning

Endogeneity and omitted variable bias in factor models: Even the FF5 model is not the “true” model. If a missing factor (momentum, quality, low-volatility) correlates with both NVDA’s returns and the included factors, then $\hat{\alpha}$ is biased — the Gauss-Markov theorem fails when relevant variables are omitted. The alpha estimate from any factor model should be interpreted cautiously, with awareness that the “unexplained” component may reflect missing risk factors rather than genuine skill or market inefficiency.

Try It! — Analyse Your Own Stock

Try it!

Pick any stock (AAPL, TSLA, MSFT, AMD) and:

Pick any stock from the book’s data/ folder (AAPL, NVDA, SPY) or load your own CSV with pd.read_csv()
Load FF5 factors from data/ff5_daily_2023_2024.csv
Fit CAPM and FF5 models
Compare Adjusted R$^2$ — do extra factors help?
Is there significant alpha?
Aggressive ($\beta>1$) or defensive ($\beta<1$)?

Hint: Start with tickers = ["AAPL", "SPY"] and reuse the pipeline.

Takeaway — The Regression Pipeline

Pipeline: Explore → Visualize → Build → Evaluate → Select → Apply

Step	Tool
Explore	`.corr()`
Visualize	`regplot`
Build	`sm.OLS`
Evaluate	R$^2$, RMSE
Select	p-values, AIC/BIC
Apply	`.predict()`

Key takeaway

SLR = CAPM: one market factor, $\beta$ measures risk exposure.

MLR = Fama-French: five factors explain more variation in returns.

Variable selection: only keep factors that are statistically significant.

Next: Topic 4 — Clustering (K-Means & Hierarchical).

Chapter Summary

The Complete Regression Workflow

Every regression analysis follows the same five-stage pipeline. Skipping any stage is a professional error.

1. EXPLORE → scatter plots, correlations, descriptive stats
        ↓
2. MODEL → specify y = α + β₁x₁ + ... + βₖxₖ + ε
        ↓
3. ESTIMATE → OLS: minimize Σ(yᵢ - ŷᵢ)²
        ↓
4. INFER → t-tests, p-values, confidence intervals for each βⱼ
        ↓
5. DIAGNOSE → LINE residual plots: L (linearity), I (independence),
              N (normality), E (equal variance)
        ↓
6. SELECT → Compare models: Adj-R², AIC, BIC, Cp
        ↓
7. PREDICT → point forecast ŷ*, CI for μ_Y, PI for individual Y

Top 10 Things to Check in Any Regression Output

When you receive a regression table — in an academic paper, a consultant’s report, or your own code — work through this checklist before trusting any number:

F-statistic and overall p-value: Is the model significant at all? If $p > 0.05$, stop — none of the variables may matter.
Adjusted R²: How much variation does the model explain, penalized for complexity? Compare across models with different $k$.
Individual p-values: Which coefficients are statistically significant? Be cautious about over-reliance on a 0.05 threshold in the presence of multiple testing.
Coefficient signs: Do they match economic intuition? A negative income effect on pharmacy profit would be a red flag requiring explanation.
Coefficient magnitudes: Are they economically meaningful? A statistically significant but infinitesimally small coefficient may not matter in practice.
Confidence intervals: Do any CIs contain zero? A CI that barely misses zero is less compelling than one centered far from zero.
VIF for multicollinearity: Any VIF > 10 invalidates individual coefficient interpretation. Report and address.
Residual plots (LINE): Did you check linearity, independence, normality, and equal variance? A regression without diagnostic plots is incomplete.
Train vs. test performance: Is R² on held-out data close to in-sample R²? A large gap signals overfitting.
Sample size and data quality: Were outliers examined? Is the data representative? Are there structural breaks (e.g., the COVID period changing all financial relationships)?

When NOT to Use Linear Regression

Linear regression is powerful but not universal. Recognize when to reach for a different tool:

Binary outcome ($y \in \{0,1\}$): Use logistic regression. OLS can predict probabilities outside $[0,1]$, producing nonsensical results.
Count outcome ($y = 0, 1, 2, \ldots$): Use Poisson regression. OLS may predict negative counts.
Highly nonlinear relationships: If residual plots show clear curvature and polynomial terms don’t fix it, consider decision trees, gradient boosting (XGBoost), or neural networks. Linear regression imposes a global linear approximation; tree models adapt locally.
Very high-dimensional predictors ($p \gg n$): Best subset selection is infeasible. Use LASSO, Ridge, or Elastic Net — regularized regression that handles hundreds or thousands of predictors.
Time series with complex dynamics: Stock returns with GARCH volatility or mean reversion require GARCH/ARIMA models. OLS applied to serially correlated returns produces unreliable standard errors.
Causal inference: If you need to estimate a causal effect (not just a predictive association), linear regression alone is insufficient. You need instrumental variables, difference-in-differences, regression discontinuity, or randomized assignment. The coefficient on an observational regressor is generally not a causal effect.

Warning

The most common misuse of regression in business: treating a statistically significant coefficient as evidence of a causal relationship. Income predicts pharmacy profit, but building a pharmacy in a high-income area will not cause income to rise. The coefficient reflects a correlational pattern, not a causal mechanism. Always ask: could there be a third variable driving both $X$ and $Y$?

What’s Next: Chapter 4 — Clustering

Chapter 3 addressed supervised learning: we always had a target variable $Y$ (NVDA return, pharmacy profit) and used regression to predict it from $X$ features. The regression line is supervised by the labels.

Chapter 4 introduces unsupervised learning: there is no $Y$. Instead, we discover hidden structure in $X$ alone. K-means clustering partitions observations into groups based on proximity in feature space — no return to predict, no profit to explain, just “which observations are similar to each other?” The distance concept in K-means is the geometric cousin of OLS: both minimize a sum of squared distances, just in different directions.

The business applications shift accordingly: instead of predicting a customer’s lifetime value (regression), we ask “which customers are similar enough to segment together?” Market segmentation, portfolio construction by risk-return profile, and fraud detection all rely on clustering.

VIF	Interpretation
\(< 5\)	OK
\(5\)–\(10\)	Moderate
\(> 10\)	Severe

Chapter Introduction

Two Hundred Years of Regression

Why Regression Dominates Business Analytics

What You Will Learn

Why This Chapter Is Weighted at 30%

Why Predict?

1. Visualizing Association

Background: The History of Correlation

Correlation Coefficient

Load Data & Compute Correlation

Simulated NVDA vs SPY in Pyodide

Scatter Plot: NVDA vs. SPY

2. Simple Linear Regression: CAPM

Background: Sharpe’s Revolution

Simple Linear Regression Model

The Capital Asset Pricing Model (CAPM)

Prepare the Data

Fit the CAPM with statsmodels

Reading the CAPM Output

Variation Decomposition

3. Inference on Regression Coefficients

Background: Fisher, Neyman-Pearson, and the p-Value Wars

LINE Assumptions for Inference

Sampling Distributions of \(\hat{\alpha}\) and \(\hat{\beta}\)

Why Do Standard Errors Matter?

model.summary() Output

\(t\)-Tests for Slope and Intercept

What Is a \(p\)-Value? (Two-Tailed)

One-Tailed \(p\)-Value

Testing \(\alpha > 0\) and \(\beta > 1\) in CAPM

Critical Value Approach

Confidence Intervals for \(\alpha\) and \(\beta\)

Reading the Statsmodels CI Output

Try It! — Inference on CAPM

4. CI for \(\mu_Y\) vs. Prediction Interval

Background: Uncertainty About Means vs. Uncertainty About Individuals

Two Questions, Two Intervals

The Formulas — One Extra Term

Why PI Is Always Wider

Python: get_prediction() for Both Intervals

Plot Both Bands Together

Business Interpretation

5. Train/Test Split & Model Evaluation

Background: The Out-of-Sample Discipline

Why Split the Data?

R\(^2\) and RMSE

Compute Train and Test Metrics

Overfitting vs. Underfitting

Detecting Overfitting: Train vs. Test Gap

6. Assumptions & Residual Analysis

Background: The Gauss-Markov Theorem and What It Guarantees

The LINE Assumptions

Assumption L: Linearity — Residuals vs. Fitted

Assumption I: Independence — Residuals Over Time

Assumption N: Normality — Q-Q Plot & Histogram

Assumption E: Equal Variance — Scale-Location Plot

Checking Assumptions with Python

LINE Diagnostics — CAPM Output

What Happens When Assumptions Fail?

Assumptions Hold vs. Fail: The Big Picture

7. Multiple Linear Regression: Fama-French

Background: From One Factor to Five

Why Do We Need Multiple Regression?

Multiple Linear Regression (MLR)

Interpreting Coefficients: “Ceteris Paribus”

Fitting MLR in Python

Live Demo: Predicting Pharmacy Profit

Fit the Pharmacy Multiple Regression

Interpreting the Pharmacy Model

Predict Profit for a New Branch

R\(^2\) vs. Adjusted R\(^2\)

Multicollinearity

Detecting Multicollinearity in Python

Understanding \(R_j^2\) in VIF — A Simple Example

Application: From CAPM to Fama-French

What Each Factor Means

Load Fama-French Factors

Build the Fama-French Regression

Interpreting the Factor Loadings

Testing Factor Significance

Fit the CAPM with `statsmodels`

`model.summary()` Output

Python: `get_prediction()` for Both Intervals