• 📖 Cover
  • Contents

Chapter 2: DataFrames

Chapter Introduction

Data analysis in the real world looks nothing like a clean textbook exercise. A McKinsey survey of data scientists found that 80% of their working time is spent cleaning, reshaping, and validating data — not building models. Before any machine learning algorithm runs, before any chart gets drawn, before any insight can be communicated, someone has to wrestle raw data into a usable shape. That someone is you, and the primary weapon is the pandas DataFrame.

A brief history: from relational databases to DataFrames

The conceptual foundation for tabular data was laid in 1970 when Edgar Codd, a computer scientist at IBM, published “A Relational Model of Data for Large Shared Data Banks.” Codd’s insight was deceptively simple: organize data as tables (relations) where rows are observations and columns are attributes. From that paper came SQL, the language that powered every enterprise database for the next fifty years — Oracle, MySQL, PostgreSQL, SQL Server. If you have ever heard of a “database schema,” a “primary key,” or a “JOIN,” you are living inside Codd’s 1970 idea.

SQL is powerful, but it requires a database server, a connection, and a query language separate from your analytical code. In the early 2000s, Wes McKinney was a financial analyst tired of context-switching between Excel and statistical tools. In 2008, he released pandas, a Python library built around a single central object: the DataFrame. A DataFrame is a relational table — rows are observations, columns are variables — but it lives inside your Python program, responds to Python syntax, and integrates directly with NumPy, matplotlib, scikit-learn, and every other tool in the data science ecosystem.

The killer feature was making DataFrames first-class Python objects that support mixed data types (numbers, strings, dates, booleans in the same table), carry column names and row labels, handle missing values natively, and compute element-wise arithmetic without explicit loops. SQL does roughly the same things, but with pandas you write df['Profit'] = df['Revenue'] - df['Cost'] instead of a SELECT statement — and the result is immediately available for the next line of analysis.

Why pandas conquered analytics

Three properties sealed the victory. First, mixed types: a single DataFrame can hold dates, categorical strings, floats, and integers simultaneously, which no NumPy array can do natively. Second, the labeled index: every row and column has a name, so df.loc['2020-01-15', 'Close'] is readable by a human and unambiguous to the machine. Third, time-awareness: date and time indices unlock resampling, rolling windows, lag operators, and date-range slicing — operations that are painful in SQL and trivial in pandas.

Today, pandas is the lingua franca of data analysis in Python. Every major data science library — scikit-learn, statsmodels, seaborn, plotly, XGBoost, PyTorch Lightning — accepts or returns a DataFrame. Learning pandas fluently is not optional for a data analyst; it is the entry ticket.

The data-to-insight pipeline

Every analysis, from a one-page sales report to a production machine learning system, follows the same six-step pipeline:

  1. Load — pull data from CSV, Excel, a SQL database, or an API.
  2. Explore — understand shape, types, and basic statistics.
  3. Clean — handle missing values, fix dtypes, remove duplicates.
  4. Transform — compute new features, reshape, join tables.
  5. Analyze — run statistics, models, or aggregations.
  6. Visualize — communicate findings with charts.

This chapter covers every step. You will learn to load and explore DataFrames, select and filter rows and columns, engineer new features, handle missing data, plot results, standardize variables, and work with time series. These are the exact operations used daily by analysts at Goldman Sachs, Google, Airbnb, and every quantitative hedge fund.

Why this chapter is weighted at 40% of the course

The skills in this chapter appear in every subsequent topic. Linear regression (Chapter 3) requires feature engineering and train/test splitting — both covered here. Clustering (Chapter 4) requires standardization and plotting — both here. If you are shaky on DataFrames, every downstream technique becomes harder. Invest the time now: every hour here pays back ten hours later.


Table of Contents

  1. Loading and Exploring Data
  2. Selecting Data
  3. Creating and Modifying Data — and Method Chaining
  4. Key Statistical Methods
  5. Handling Missing Values
  6. Plotting with Pandas
  7. Standardization
  8. Time Series Methods

Loading and Exploring Data

Background: where does business data live?

Every real-world dataset starts its life somewhere outside Python. It may be a CSV exported from an ERP system (SAP, Oracle), a table in a MySQL or PostgreSQL database, a REST API response from Stripe or Salesforce, a spreadsheet emailed by a client, or a parquet file sitting in AWS S3. The first step of every analysis is always the same: get the data into a DataFrame.

pd.read_csv() handles the most common case. For databases, pd.read_sql() accepts a SQL query and a connection. For Excel files, pd.read_excel(). For JSON, pd.read_json(). pandas abstracts away the source; once the data is loaded, all subsequent operations are identical regardless of where it came from.

After loading, the very first thing a disciplined analyst does is look at the data. This sounds obvious, but it is routinely skipped — at great cost. Missing values, wrong dtypes, date columns stored as strings, price columns with currency symbols, columns with 90% zeros: none of these show up until you look. The habits of .head(), .shape, .dtypes, and .describe() are not bureaucratic rituals; they are the fastest path to catching problems before they corrupt your results.

Data quality has four dimensions a good analyst checks: completeness (are values missing?), consistency (does the same entity appear with the same label everywhere?), accuracy (are the values plausible — is a human age of 300 in the data?), and timeliness (is the data fresh enough for the question being asked?). None of these can be assessed without first loading and exploring.

In this section you will practice loading a synthetic S&P 500 dataset and running the standard first-look commands. By the end, these five lines — df.head(), df.tail(), df.shape, df.dtypes, df.describe() — should be as automatic as saving a file.

What Is a DataFrame?

Why it matters

A DataFrame is like an Excel spreadsheet in Python — rows, columns, and an index. Each column is a Series.

import pandas as pd

# Load from a local CSV file
df = pd.read_csv("data.csv")

# Load the course's S&P 500 dataset (hosted alongside this book)
df = pd.read_csv("https://busanalytics-book.pages.dev/data/sp500.csv", parse_dates=["Date"], index_col="Date")

Schema illustration:

Index Open High Low Close
2020-01 3244 3285 3214 3258
2020-02 3248 3260 3225 3245
2020-03 3230 3251 3210 3240

Rows = observations; columns = variables (each column is a Series).

First Look at Your Data

Always do this first — it’s like meeting someone new, say hello first!

Key takeaway

Step 1 of every analysis: Load the data, then run .head(), .shape, and .dtypes to understand what you have.

Summary Statistics with .describe()

  • count: non-missing values, mean: average, std: spread
  • min/max: extremes, 25%/50%/75%: quartiles

Interpreting .describe() output: The output from this cell tells a rich story about the data’s distribution. count tells you immediately if there are missing values — if it is less than the total number of rows, some observations are NaN. The mean and std together let you check plausibility: for an S&P 500 price series, a mean near 3,200 and a standard deviation near 250 is entirely consistent with the 2019–2021 period. The 25th, 50th, and 75th percentiles reveal shape: if the mean (3,217) is close to the median (50th percentile), the distribution is roughly symmetric; a large gap signals skewness or outliers. The min and max are your outlier flags — if the minimum price is negative or the maximum is ten times the mean, something went wrong at data entry. Get in the habit of reading these numbers before writing a single line of analysis code.

Common pitfall: dtype inference

pd.read_csv() guesses column types automatically. A price column with one bad row containing "N/A" (as a string) will be read as object (text) instead of float64, causing every arithmetic operation to fail silently or raise an error. Always check df.dtypes after loading and explicitly cast suspicious columns: df['Price'] = pd.to_numeric(df['Price'], errors='coerce').

Try It! — Load and Explore

Try it!

Load the S&P 500 data and answer:

  1. How many rows and columns?
  2. What are the column names?
  3. What is the average closing price?
  4. What is the maximum trading volume?

Hint: df.shape, df.columns, df['Close'].mean(), df['Volume'].max()

In practice

At a hedge fund, the first thing a quantitative analyst does with any new data feed is run .describe() on every numeric column and .value_counts() on every categorical column. This “data quality pass” often takes 20–30 minutes and catches problems that would otherwise corrupt weeks of backtest results. At Renaissance Technologies, reportedly, data cleaning consumes more engineering time than model development. The same discipline applies to any analytics role — from e-commerce dashboards at Shopify to risk reports at JPMorgan.

Selecting Data

Background: selection as relational algebra

The theory behind selecting subsets of data was formalized by Codd in the same 1970 paper that gave us relational databases. He called the two fundamental operations projection (selecting columns) and selection (filtering rows). In SQL you write SELECT col1, col2 FROM table WHERE condition. In pandas the equivalent is df[['col1', 'col2']] for projection and df[df['col'] > threshold] for selection. The ideas are identical; the syntax differs.

Understanding this duality matters because it frames the why behind pandas syntax choices. df['Close'] with single brackets returns a Series — a single column vector — because you projected one dimension. df[['Close', 'Volume']] with double brackets returns a DataFrame — a two-dimensional table — because you projected multiple columns. The extra set of brackets is not a quirk; it is the distinction between extracting one attribute (a list) versus keeping a table structure intact.

Row selection comes in two flavors that correspond to different mental models. .loc[] (label-based) says “give me rows where the index label is X” — this is how humans naturally think about data: “give me January 2020 data,” “give me the row for Apple.” .iloc[] (integer-position-based) says “give me row number N” — this is how a computer thinks about data. When your index is a date or a company ticker, .loc[] produces readable code; .iloc[] is best reserved for positional slicing like train/test splits.

In real-world analytics, filtering rows is how you build customer cohorts (users in a certain age group, with above-median spending, who churned in Q3), event studies (market days with returns below −2%), and compliance reports (transactions above regulatory thresholds). Mastering boolean indexing is mastering the language of data segmentation.

Select One Column — Returns a Series

Why it matters

Business: “I only want to see the Revenue column.” Single brackets = single column = Series.

Interpretation: When you extract a single column, you get a pandas Series — a one-dimensional labeled array. The output shows both the data values and the row index (integers by default). Notice that the dtype shown at the bottom is int64 — the column preserves its type even when extracted. This matters because all Series methods (.mean(), .std(), .plot()) now work directly on close_prices without referencing the original DataFrame. In production code, extracting the column you need early and working with it as a named Series makes your code more readable and your intent clearer to colleagues reviewing your work.

Select Multiple Columns — Returns a DataFrame

Key takeaway

Single brackets df['col'] → Series

Double brackets df[['col1','col2']] → DataFrame

Select Rows

Key takeaway

.iloc[] selects by integer position (just like list indexing). Use it when you know the row number.

Conditional Selection — Filtering Rows

How it works:

  1. df['Return'] < -2 creates a boolean mask (True/False for each row)
  2. df[mask] keeps only the True rows

Interpretation: The crash-day filter implements exactly the kind of event study an equity researcher runs routinely. Calling len(crash_days) tells you the frequency of tail events. In a normally distributed daily return series with mean 0% and standard deviation 1%, we expect about 2.3% of days to fall below −2 — roughly 6 out of every 250 trading days (one year). If you see many more crash days than that in your data, either the distribution is fat-tailed (common in real markets — the S&P 500 has many more large drops than a normal distribution predicts) or there is a data error. This single filter is the starting point for questions like: “Do crash days cluster? Do they follow high-volume days? Does volume spike before or after?”

Common pitfall: SettingWithCopyWarning

When you filter a DataFrame and then try to modify the result, pandas may issue a SettingWithCopyWarning. This happens because the filtered result might be a view of the original data, not an independent copy. To be safe, call .copy() explicitly: crash_days = df[df['Return'] < -2].copy(). After that, modifying crash_days will never affect df.

Multiple Conditions

Key takeaway

Use & for AND, | for OR.

Important: Each condition must be in parentheses!

Try it!

Find customers with income above median AND spending above $500.

Hint: df[(df['Income'] > df['Income'].median()) & (df['Spending'] > 500)]

Label-Based Selection with .loc[]

Key takeaway

.loc[] = label names. .iloc[] = integer positions.

When the index is dates, .loc['2020-01':'2020-03'] is far more readable than .iloc[0:65].

Interpretation: This code block introduces the two-dimensional nature of .loc[] and .iloc[]. The key insight is that both accept [row_selector, column_selector] as a pair. With .loc[], df.loc['2020-01-02', 'Close'] reads as “row labeled 2020-01-02, column labeled Close” — it is self-documenting. With .iloc[], df.iloc[0, 3] means “first row, fourth column” — fast, but opaque. In production data pipelines, .loc[] is almost always preferred because the code remains meaningful even when the underlying data is reordered or filtered. The date-range slicing df.loc['2020-01-02':'2020-01-03'] is a particularly powerful pattern that becomes invaluable when working with years of financial, web, or sensor data.

In practice

At an asset management firm, a quant analyst routinely writes returns = price_df.loc['2015-01-01':'2023-12-31', 'Close'].pct_change() to extract exactly the backtesting window needed. Without a proper datetime index and .loc[], this would require explicit row-number arithmetic that breaks the moment the dataset is updated or extended. Date-labeled indexing is one of the most pragmatically valuable features in pandas for anyone working with financial, economic, or any time-stamped business data.

.loc[] is Inclusive, .iloc[] is Exclusive

Important

.loc['b':'d'] returns rows b, c, d — the end label 'd' is included.

.iloc[1:3] returns positions 1, 2 — the end index 3 is excluded (just like Python list slicing).

Interpretation: This asymmetry trips up almost every new pandas user at least once. The reasoning behind it is pragmatic: .loc[] uses index labels, and when you write a date range like '2020-01-01':'2020-12-31' you naturally mean including December 31st. .iloc[] inherits Python’s slice convention (a[1:3] gives positions 1 and 2) to stay consistent with the rest of the language. The practical rule: when in doubt, run the code on a small DataFrame like this one and count the rows returned. Debugging a fence-post error on toy data takes ten seconds; debugging it inside a 500-line pipeline can take hours.

Train / Test Split — Time Series

For time-series data we split chronologically: the earliest observations train the model, the most recent observations test it. The diagram below illustrates the 80/20 convention used throughout this course.

Why it matters

For time series: never shuffle — temporal order matters. Shuffling would leak future information into the training set.

Background: a brief history of the train/test split

The practice of holding out data for evaluation predates machine learning. Stone (1974) formalized cross-validation as a general method for model selection — the idea that you should assess a model on data it has never seen during fitting. In time-series settings, there is an additional constraint: the train and test sets must respect the arrow of time. If you train on data from 2015–2024 and test on data from 2010, your model has effectively seen the future — a form of data leakage that produces wildly optimistic performance estimates.

The 80/20 split is a rule of thumb. In practice, the split ratio depends on dataset size (more data → you can afford a larger test set), the stability of the underlying process (rapidly changing systems require more recent test data), and computational constraints. Cross-sectional data (companies, customers, products at a single point in time) has no temporal structure, so random shuffling before splitting is not only safe but required to avoid accidental stratification.

Common pitfall: time-series shuffle leakage

For time-series data, never use df.sample(frac=1) before splitting. If you shuffle S&P 500 daily data and use 2020 rows in training while predicting 2019 rows in the test set, your model has “seen the future.” It will appear to predict perfectly — but only because it memorized outcomes from beyond the test period. In trading, this is called look-ahead bias and it is the most common cause of backtests that look great but fail in live trading.

Train / Test Split — Cross-Sectional (Shuffle)

Key takeaway

Cross-sectional data has no time order → shuffle with .sample(frac=1, random_state=42) before splitting. random_state makes the shuffle reproducible.

Creating and Modifying Data — and Method Chaining

Background: feature engineering as the source of most ML wins

“Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.” This observation — widely attributed to Andrew Ng, co-founder of Google Brain — captures something important: the quality of a model’s inputs matters more than the sophistication of the algorithm. A simple linear regression on well-engineered features frequently outperforms a deep neural network on raw inputs. The pandas operations in this section — creating computed columns, building categorical indicators, and chaining transformations — are the core tools of feature engineering.

In practice, most features are derived from existing data by combining columns arithmetically, binning continuous variables into categories, encoding time-based patterns, or applying domain-specific transformations. Examples from real pipelines: a retail company adds Margin = (Revenue - Cost) / Revenue * 100 to identify underperforming products; a bank adds Debt_to_Income = TotalDebt / AnnualIncome to predict default risk; a quantitative fund adds Return_5d = Close.pct_change(5) as a momentum signal. None of these features exist in the raw data — they are created by the analyst.

One subtle but important cost of in-place modification (df['col'] = value) is that it mutates the original DataFrame permanently within the notebook session. This is usually what you want, but in complex pipelines where the same DataFrame is used in multiple branches, it can introduce hard-to-trace bugs. A useful discipline is to build a transformation pipeline that always starts from the raw DataFrame and produces a clean, enriched copy, rather than accumulating mutations.

Method chaining, popularized in the R world by Hadley Wickham’s tidyverse (%>% pipe operator), was imported into pandas partly in response to user demand for more readable pipeline syntax. The pandas .pipe() method and the natural chaining of methods like .pct_change().dropna().sort_values().head() achieve the same goal: expressing a multi-step transformation as a single readable statement that mirrors the logical order of the operations.

Add a New Column

Why it matters

Columns are Series, so arithmetic between columns is element-wise (row by row). No loops needed!

Interpretation: The Range = High - Low column quantifies intraday volatility — how much the index moved from its lowest to highest point on each day. A wide Range on a day with a modest Close-to-Open change signals intraday turbulence that settled by the close — useful for options pricing and intraday risk management. The Mid = (High + Low) / 2 is a naive estimate of the “fair value” of the index at the midpoint of the day’s trading range. Notice that both columns are computed in a single assignment without any loop — pandas broadcasts the arithmetic across all rows simultaneously, which is both faster and more readable than a Python for loop over rows.

In practice

In customer analytics at companies like Netflix or Spotify, feature engineering columns are created from raw event logs. Common examples: days_since_last_login = today - df['last_login_date'], avg_session_length = df['total_watch_time'] / df['session_count'], is_premium = df['plan_type'] == 'Premium'. These derived columns feed directly into churn prediction models, recommendation systems, and A/B test analyses.

Conditional Column

Key takeaway

Use list comprehension to create categorical columns from numerical data. This is data labeling!

Interpretation: The Signal column produced here is a four-class trading label: Strong Buy / Buy / Sell / Strong Sell. In a supervised machine learning pipeline, this is exactly the kind of target variable you would construct from historical returns and then try to predict using features available before the return was realized (past prices, volume, news sentiment). The multi-level if-elif structure in the list comprehension implements a threshold-based rule — a baseline classifier against which more sophisticated ML models are compared. If your model can’t beat this simple rule, it isn’t learning anything useful.

Drop Columns and Rows

Key takeaway

.drop(columns=...) removes columns by name.

.drop() returns a new DataFrame — the original is unchanged (unless you use inplace=True).

Interpretation: Dropping columns is housekeeping that makes downstream code faster and more readable. A dataset loaded from a financial data vendor might have 80 columns; a particular analysis may use 5. Dropping the irrelevant 75 reduces memory usage and eliminates the risk of accidentally using a correlated feature you did not intend to include. Note that df.drop() returns a new DataFrame without modifying df — this is the pandas default (immutable operations). If you see code using inplace=True, be cautious: it mutates the original DataFrame, which can cause confusing bugs when the same DataFrame is referenced elsewhere in the notebook.

Common pitfall: df[col] == X vs df.query() vs df.loc[] performance

For large DataFrames (millions of rows), the three selection methods have different performance profiles. df[df['col'] == value] is the most common but builds a boolean array over the entire column. df.query("col == value") often runs faster because it uses numexpr under the hood for numeric expressions. df.loc[df['col'] == value] is essentially the same as the bracket form. For most course-scale datasets the difference is negligible, but in production with 10M+ rows, query() can be 2–5x faster on numeric conditions.

Method Chaining — Readable Pipelines

Why it matters

Method chaining reads like plain English: “take Close, compute returns, drop NaN, sort descending, show top 5.”

Each method returns a Series/DataFrame, so the next method can be called immediately. Use parentheses () to wrap multiline chains.

Key Statistical Methods

Background: what descriptive statistics actually tell you

Summary statistics are the first language of data. Before building any model, a skilled analyst spends time with the numbers: not just computing them mechanically, but asking what each one reveals about the underlying process that generated the data.

The mean is the balance point of a distribution — but it is only meaningful when the distribution is symmetric and has no extreme outliers. The mean income in a room with 99 average workers and one billionaire will suggest everyone is wealthy. The median (50th percentile) is far more robust: it is the middle value regardless of how extreme the tails are. When mean and median diverge significantly, the distribution is skewed — a signal that outliers are distorting the average, and that median-based statistics are more informative for that variable.

The standard deviation measures spread around the mean. In finance, standard deviation of returns is the primary measure of volatility — the risk of an investment. A stock with a daily return std of 2% is far riskier than one with a std of 0.5%, even if both have the same average return. More precisely, annualized volatility is std * sqrt(252) (there are roughly 252 trading days per year) — a number that appears in the Black-Scholes options pricing formula and in every risk management framework.

The percentiles (25th, 50th, 75th) together with min and max paint a picture of the full distribution. The interquartile range (IQR = Q3 − Q1) is a robust spread measure used to detect outliers: values more than 1.5 × IQR below Q1 or above Q3 are flagged as potential outliers by the standard box-plot rule.

Correlation measures linear co-movement between two variables. A value near +1 means they move together; near −1 means they move in opposite directions; near 0 means no linear relationship. Correlation does not imply causation — two variables can be perfectly correlated simply because both are driven by a third variable (confounding).

Column Statistics

Why it matters

Finding the best and worst days is one of the first things analysts do when studying market behavior.

Interpretation: The .sort_values('Return', ascending=False).head(5) output shows the five best trading days in the simulated dataset. In a real S&P 500 return series, the best single days often occur during bear markets — the largest one-day rally in S&P 500 history occurred on October 13, 2008, at the height of the financial crisis, when the index jumped 11.6%. This counterintuitive pattern — that the best and worst days cluster together in high-volatility periods — is a key property of financial returns called volatility clustering, and it has profound implications for risk management. A naive model that removes “outlier” days would discard exactly the observations that matter most for understanding tail risk.

Understanding axis — Rows vs. Columns

axis visualization:

         col A   col B
row 0 |   10  |  20
row 1 |   30  |  40
row 2 |   50  |  60

axis=0  down (collapse rows)
axis=1  right (collapse columns)
Key takeaway

axis=0 collapses rows ⇒ one result per column (default).

axis=1 collapses columns ⇒ one result per row.

axis in Action

Important

Most pandas methods default to axis=0 (operate down each column). You only need to write axis=1 when you want to compute across columns for each row.

Correlation — Which Variables Move Together?

Key takeaway

.corr() computes pairwise Pearson correlation. Values range from \(-1\) (perfectly opposite) to \(+1\) (perfectly together).

Business: Is there a correlation between ad spend and revenue?

Interpretation: In this simulated OHLCV dataset, the correlation between Open, High, Low, and Close is very high (near 0.99) — this is expected, because all four prices are anchored to the same underlying market level on any given day and the within-day variation is small relative to the across-day variation. A correlation of 0.99 between price columns is not informative; it just means prices move together over time. However, a high correlation between price and volume would be much more interesting — it might indicate that high prices are associated with increased trading activity (a liquidity signal). The negative correlation between Close and Volume in this synthetic data (Volume was constructed as \(-\text{Close} \times 10^6 + \text{noise}\)) is artificial and not representative of real markets. Always verify that simulated data matches the qualitative properties of the real phenomenon before drawing conclusions.

In practice

Correlation matrices are used routinely in portfolio construction. The covariance matrix (which is a scaled version of the correlation matrix) is the central input to Markowitz mean-variance optimization and to risk parity strategies. At quantitative hedge funds, correlation matrices of asset returns are estimated daily and used to set position sizes such that no single factor dominates portfolio risk. Pandas’ .corr() is often the first step in such a pipeline, even if the production system uses more sophisticated estimators like the Ledoit-Wolf shrinkage estimator.

Handling Missing Values

Background: the statistics of missingness (Rubin 1976)

Missing data is not just an inconvenience — it is a statistical problem with different implications depending on why the data is missing. Donald Rubin (1976) introduced a taxonomy that is still the foundation of missing-data theory:

MCAR (Missing Completely At Random): The probability of a value being missing is unrelated to any variable in the dataset — and unrelated to the missing value itself. Example: a sensor randomly drops 1% of readings due to a hardware glitch. MCAR data is the easiest to handle: you can drop the rows or impute with the mean and the resulting analysis will be unbiased.

MAR (Missing At Random): The probability of missingness depends on other observed variables, but not on the missing value itself. Example: older survey respondents are less likely to answer income questions, but conditional on age, the missingness is random. This is the most common real-world case. MAR data can be handled correctly with model-based imputation (e.g., predict the missing value from the other columns using a regression or k-NN).

MNAR (Missing Not At Random): The probability of missingness depends on the missing value itself. Example: high-income individuals are less likely to report income precisely because it is high. This is the hardest case — any simple imputation introduces bias. MNAR requires domain knowledge and specialized models.

Business implications of each type: In a credit risk model, if low-income applicants disproportionately skip the “assets” field (MNAR), imputing with the mean will overestimate the creditworthiness of the riskiest borrowers. In a compliance report at a bank, missing transaction amounts may indicate fraud (MNAR) rather than data entry errors (MCAR). Knowing why your data is missing changes what you should do about it.

Mean imputation pitfalls: Replacing missing values with the column mean is the most common approach — and also one of the most dangerous when used thoughtlessly. It reduces variance (all imputed values are the same, compressing the distribution), distorts correlations between columns (the imputed rows now have “average” values in one column but real values in another, which breaks the co-movement structure), and is only correct under MCAR. For large fractions of missing data (>10%), or for MNAR data, mean imputation can introduce more bias than simply dropping the rows.

Why Do We Have NaN?

Why it matters

Real-world data is messy. Sensors fail, surveys get skipped, records get lost. Pandas represents missing values as NaN (Not a Number).

Interpretation: df.isna().sum() gives a per-column missing count, which is often the first diagnostic you run. If Sales shows 2 missing out of 5 rows (40%), that is a significant fraction — dropping those rows would lose nearly half the data. If Rating shows 1 missing out of 5 (20%), the choice between filling and dropping depends on how critical the Rating column is for the analysis. df.isna().sum().sum() gives the total NaN count across the entire DataFrame — a quick overall “health check.” A quick way to compute the missing percentage by column is df.isna().mean() * 100, which is often more interpretable than raw counts when tables have hundreds of rows.

Fill Missing Values

Drop Missing Values

Interpretation: df.dropna() with no arguments uses a conservative rule: if any value in a row is NaN, the entire row is removed. With the data above, this drops rows 1, 2, and 4 (indices 1, 2, 4) — three out of five rows, leaving only two clean observations. That is too aggressive here; we lose data that is valid in one column just because the other column is missing. The subset=['Sales'] version is more surgical: drop a row only if Sales specifically is NaN. This preserves row 2 (where Rating is NaN but Sales is valid). The general lesson: always use subset= to specify which columns matter for completeness, rather than applying the all-or-nothing default.

In practice

In financial compliance reporting, missing values in transaction records must be handled according to regulatory guidelines — not just analyst preference. At many banks, missing counterparty identifiers must be flagged and escalated, not imputed. Missing trade amounts may trigger automatic rejection. The pandas tools here are the mechanical implementation; the business decision about what “missing” means is made by domain experts (compliance officers, risk managers), not by the data scientist alone.

Key takeaway

When to fill: You need every row and can estimate the missing value reasonably.

When to drop: Missing values are few (\(<5\%\)) and random.

Never: Ignore NaN and hope for the best!

dropna() — Rows vs. Columns with axis

Important

dropna() defaults to axis=0 (drop rows).

dropna(axis=1) drops columns — useful when an entire column is mostly empty.

Use subset= to only check specific columns instead of all.

Common pitfall: dropna() axis confusion

df.dropna(axis=1) drops columns that contain any NaN — not rows. This is almost never what you want when cleaning row-level records. A common mistake is accidentally dropping columns that have even one missing value in thousands of rows, permanently losing those variables from the analysis. If you want to drop columns that are mostly empty, use df.dropna(axis=1, thresh=int(0.9 * len(df))) — this keeps a column only if at least 90% of its values are present.

Plotting with Pandas

Background: exploratory vs. presentation plots

The discipline of data visualization has two distinct modes, a distinction made famous by John Tukey in his 1977 book Exploratory Data Analysis. Exploratory plots are made quickly, for yourself, to understand the data — they do not need to be beautiful. Presentation plots are made carefully, for an audience, to communicate a specific finding — they need to be clear, labeled, and honest.

Most of the time you spend plotting during analysis is exploratory. You want to know: is this distribution skewed? are there outliers? does this variable trend upward over time? does the relationship between X and Y look linear? Answering these questions takes a single line of pandas code and a glance. Spending thirty minutes polishing the aesthetics of an exploratory plot is wasted time.

When you have a finding worth communicating, the investment in presentation quality pays off. A clear chart with a meaningful title, labeled axes, a legend, and an appropriate scale can convey in three seconds what a table of numbers cannot convey in three minutes. Tukey’s observation — that “the greatest value of a picture is when it forces us to notice what we never expected to see” — applies equally to both modes.

When to use each chart type is a decision that trips up many beginners:

  • Line chart: one continuous variable over an ordered axis (time). Use for stock prices, temperature, web traffic, sales trends.
  • Histogram: distribution of one continuous variable. Use to see shape, skewness, outliers. If returns look bell-shaped or fat-tailed, a histogram reveals it.
  • Bar chart: comparing counts or aggregated values across discrete categories. Use for “sales by region,” “users by plan type,” “count of Bull vs Bear days.”
  • Box plot: comparing distributions across groups. Shows median, quartiles, and outliers simultaneously. Use when you want to compare the spread of a variable across categories.

Choosing the wrong chart type does not just make the visualization ugly — it can be misleading. A bar chart of means without error bars suppresses information about variance. A line chart connecting unordered categories implies a trend that does not exist.

One Method, Four Chart Types

Key takeaway

pandas integrates with matplotlib. The .plot() method handles most chart types. Customize with figsize, title, color, xlabel, ylabel.

Interpretation: These four chart types answer four different questions about the same dataset. The line chart of Close shows the path the index took over time — you can see drift, crashes, and recoveries. The histogram of Close shows the distributional shape — is the price distribution symmetric, or do most prices cluster in a narrow range? The bar chart of Market (Bull/Bear) answers “how often are days bullish vs bearish?” — a simple frequency table turned visual. The box plot of Open and Close side by side answers “does the distribution of opening prices differ from closing prices?” — comparing medians and spreads across two columns simultaneously. Each chart takes one line of code; together they give a more complete picture of the data than any single view.

In practice

At an e-commerce company running an A/B test, a data analyst would plot the distribution of purchase values in the control and treatment groups as histograms (to check for outliers before computing means), then plot the daily conversion rate as a line chart over the test period (to check for ramp-up effects or day-of-week patterns), and finally use a bar chart to compare the aggregate conversion rate across segments. These are not cosmetic additions — they are the checks that prevent reporting a statistically significant result that is actually driven by a data artifact.

Four Charts at a Glance

Chart Code Use for
Line df['Close'].plot() Trends over time
Histogram .plot(kind='hist') Distribution of one variable
Bar .plot(kind='bar') Comparing categories
Box .plot(kind='box') Spread, median, outliers

Tip: screenshot this reference slide.

Try It! — Plot the Data

Try it!

Using the S&P 500 data:

  1. Plot the closing price as a line chart

  2. Plot the distribution of daily returns as a histogram

  3. Count Bull vs Bear days and plot as a bar chart

  4. Bonus: Which month had the highest average closing price?

    Hint: Convert index to datetime, then use .resample('ME').mean()

Standardization

Background: why scale matters for distance-based methods

Many machine learning algorithms are distance-sensitive: they measure how similar two observations are by computing a distance (Euclidean, Manhattan, cosine, etc.) between their feature vectors. When features have wildly different scales — say, a “price” column with values in the thousands and a “number of employees” column with values in the millions — the larger-scale feature dominates every distance calculation simply because of its magnitude, not because it is more informative. The algorithm effectively ignores the small-scale features.

Examples of scale-sensitive algorithms: k-nearest neighbors, k-means clustering (Chapter 4), support vector machines, principal component analysis, and neural networks. Examples of scale-insensitive algorithms: decision trees, random forests, gradient boosting. This is why you will standardize before clustering in Chapter 4 but not before a decision tree.

Two standardization methods dominate in practice:

Z-score (standard scaling): Subtract the mean and divide by the standard deviation. The result has mean 0 and standard deviation 1. This preserves the shape of the original distribution (a normal distribution stays normal; a skewed distribution stays skewed) but puts all features on a common scale. Most appropriate when the data is roughly bell-shaped and when you want outliers to retain their relative position.

Min-Max scaling: Subtract the minimum and divide by the range. The result is bounded in [0, 1]. This is appropriate when you need bounded inputs (e.g., pixel values in image processing, activation functions in neural networks that expect inputs between 0 and 1). The weakness: it is extremely sensitive to outliers. If one observation has an anomalously large value, it compresses all other values toward 0.

In practice, a neural network’s input layer expects features scaled to a consistent range — typically [0, 1] with min-max or roughly [−3, 3] with z-score. Without scaling, gradients during backpropagation can be unstable, and the network converges slowly or not at all. This is why every production ML pipeline includes a scaling step.

Min-Max Standardization — Scale to \([0,1]\)

\[x_{\text{scaled}} = \dfrac{x - \min}{\max - \min}\]

Why it matters

Use min-max when you need bounded output in \([0,1]\) — e.g., neural network inputs, or comparing variables with very different scales (price vs. volume).

Interpretation: After min-max scaling, the output column Volume_minmax contains values in [0, 1] where 0 corresponds to the smallest volume observed and 1 corresponds to the largest. A value of 0.75 means “this day’s volume was 75% of the way from the minimum to the maximum observed volume.” This interpretation is intuitive but fragile: if tomorrow’s volume exceeds the historical maximum, the scaled value will be greater than 1, breaking the [0, 1] guarantee. For this reason, the scaling parameters (min and max) should always be computed on the training data and applied to the test data without recomputing — exactly the protocol that scikit-learn’s MinMaxScaler enforces.

Z-Score Standardization

\[x_{\text{std}} = \dfrac{x - \mu}{\sigma} \qquad \Rightarrow \qquad \mu_{\text{new}} = 0,\; \sigma_{\text{new}} = 1\]

Key takeaway

Min-max: Use when distribution is unknown or bounded output required.

Z-score: Use when data is roughly bell-shaped — preserves relative distances.

Neither: Use when scale is meaningful (e.g., price ratios, percentages).

Interpretation: The z-score column has mean approximately 0 and standard deviation approximately 1. A value of +2.5 means “this observation is 2.5 standard deviations above average volume” — immediately interpretable in terms of how unusual it is. Under a normal distribution, only about 1.2% of observations exceed +2.5 standard deviations, so a z-score of +3 or higher is a genuine anomaly flag. In financial risk management, z-scores are used to identify unusual trading days or abnormal position sizes. In fraud detection, transaction amounts with z-scores above a threshold (say, 4) are automatically flagged for review. The power of the z-score is precisely this: it translates raw numbers into a universal “how unusual is this?” scale.

In practice

In a customer segmentation pipeline at a company like Spotify or Netflix, features like “number of streams per month,” “average session duration,” and “days since last login” have completely different scales. Before running k-means clustering (Chapter 4), all features must be standardized, otherwise the clustering will be driven entirely by whichever variable has the largest raw variance. This is not a theoretical concern — in practice, skipping standardization before k-means typically produces clusters that correspond to the scale of one dominant variable rather than to meaningful customer segments.

Time Series Methods

Background: the special challenges of temporal data

Time series data breaks many assumptions that standard statistical methods rely on. Most regression and machine learning methods assume that observations are independent and identically distributed (i.i.d.) — each row is a random draw from the same population with no relationship to other rows. Time series data violates this assumption in at least two ways.

First, autocorrelation: today’s value is correlated with yesterday’s value. Stock prices are a near-random walk, but even they have serial correlation in volatility (the GARCH effect: high-volatility days cluster together). Sales data has weekly, monthly, and annual cycles. Web traffic has hour-of-day and day-of-week patterns. Ignoring autocorrelation means underestimating uncertainty in forecasts and producing invalid confidence intervals.

Second, non-stationarity: the statistical properties of the series (mean, variance) change over time. A retail sales series might trend upward over ten years (non-stationary mean), and its variance might increase with the level (non-stationary variance). Many statistical models require or implicitly assume stationarity. The most common transformation to induce stationarity in a price series is taking first differences or percentage changes — which is exactly what .pct_change() computes. This is why daily returns (not prices) are almost always the unit of analysis in quantitative finance.

.pct_change() is one of the most-used pandas operations in finance and economics. “What changed?” is almost always a more meaningful question than “what is the level?” A stock at $150 yesterday and $145 today lost 3.3% — that is the economically meaningful quantity. The level $145 depends on where the stock was priced ten years ago; the return −3.3% is comparable to any other stock, to the market index, to a bond yield.

Rolling windows capture the idea of “recent history.” A 20-day rolling average asks: what was the average price over the most recent 20 trading days? The window slides forward one day at a time, producing a smoothed version of the original series that filters out day-to-day noise. In trading, the famous “moving average crossover” signal — buy when the short-term MA crosses above the long-term MA — is computed using exactly this operation. In retail, a 30-day rolling sum of sales tracks recent demand in real time, which Walmart uses to trigger automatic reorders.

pd.to_datetime() — Parse Date Strings

Key takeaway

Always convert date strings to datetime — it unlocks date filtering, resampling, and all time-arithmetic operations.

Interpretation: The difference between object dtype (string) and DatetimeIndex is the difference between a label and a number. With string dates, pandas cannot compute the number of days between two observations, cannot extract the month, cannot resample from daily to monthly, and cannot do date-range slicing. After pd.to_datetime(), all of these operations become trivial. The df.index.month and df.index.dayofweek attributes allow you to add calendar-based features to your dataset — for example, “is this a Monday?” or “is this in Q4?” — features that are often powerful predictors in retail sales, web traffic, and consumer behavior models.

.shift() and .pct_change()

Key takeaway

.shift(k) lags by \(k\) periods. .pct_change(k) computes \(\frac{x_t - x_{t-k}}{x_{t-k}}\). The first \(k\) rows are NaN.

Interpretation: The Return column computed by .pct_change() is the daily percentage change in price. Looking at the first rows: Close moves from 100 → 102 → 99 → 105 → 98, which corresponds to returns of NaN, +2.0%, −2.9%, +6.1%, −6.7%. The first row is NaN because there is no “previous price” to compute a return from — this is expected and should be dropped before further analysis. The 5-day return Return_5d computes how much the price changed over a week: for row 5 (index 5), it compares close[5] to close[0], giving the 5-trading-day performance. This is a standard momentum signal: stocks that gained over the past 5 days tend to continue gaining (in the short run) according to the short-term momentum literature.

In practice

In retail demand forecasting (Walmart, Amazon, JD.com), .pct_change() and .shift() are used to engineer lag features for machine learning models: “what were sales 7 days ago,” “what were sales the same week last year.” These lag features are often the most predictive variables in demand forecasting models. At Amazon, the demand forecasting system uses hundreds of lag and rolling features computed exactly this way to predict product-level demand days in advance and pre-position inventory in warehouses.

Rolling Statistics — Moving Windows

Why it matters

Moving averages smooth noise. Traders use MA-20 (short-term), MA-50 (medium), MA-200 (long-term) as trend signals.

Rolling std = realized volatility — a key risk input.

Key takeaway

.rolling(n).mean() computes the average over the last \(n\) observations.

Cumulative Methods

Key takeaway

.cumprod() compounds returns. .cummax() tracks the running peak.

Drawdown \(= \frac{\text{price} - \text{running max}}{\text{running max}}\) is a standard risk metric.

Interpretation: The drawdown series is one of the most important risk metrics in portfolio management. It answers: “if I had invested at the worst possible time (the peak), how much would I have lost by now?” A drawdown of −0.20 means a 20% loss from the peak — the investor has “drawn down” 20 cents for every dollar invested at the high. The maximum drawdown is drawdown.min() — the worst peak-to-trough loss over the entire history. This number tells a hedge fund investor something that annualized return does not: how much pain they would have had to endure to achieve that return. A fund with a 20% annual return and a 50% maximum drawdown is often less appealing than one with a 15% annual return and a 20% maximum drawdown, because most investors cannot stay disciplined through a 50% loss.

The rolling MA plot overlaid on the price series produces the classic technical analysis chart. The MA-20 (20-day moving average) smooths short-term noise; when price crosses above the MA-20, it is a bullish signal in trend-following strategies. However, it is important to note that moving average crossover strategies do not work consistently in efficient markets — they are prone to “whipsawing” (many false signals in sideways markets) and their apparent backtested profitability often disappears after accounting for transaction costs. The value of the MA visualization is primarily as a noise-reduction tool for visual analysis, not as a standalone trading signal.

In practice

The cumulative return and drawdown series produced here are the two core outputs of any performance attribution report at a hedge fund or wealth management firm. Bloomberg’s portfolio analytics, FactSet’s performance attribution, and internal risk dashboards at every major asset manager display exactly these two time series for every strategy. The pandas code here produces the same quantities that sophisticated financial systems compute — the production versions add benchmark comparisons, risk-adjusted metrics (Sharpe ratio, Calmar ratio), and period-level attribution, but they all start with the same .pct_change(), .cumprod(), and .cummax() building blocks.

Takeaway — The DataFrame Workflow

The 6-step pipeline:

  1. Load — read_csv()
  2. Explore — head(), shape, dtypes
  3. Select — df[cond], .loc[], .iloc[]
  4. Modify — df['new'] = ...
  5. Analyze — describe(), corr()
  6. Plot — .plot()
Key takeaway

Every data analysis follows this pipeline. Today you learned the tools for each step. Practice on your own data!

Chapter Summary

The DataFrame workflow as a pipeline

Every analytical task — whether it is a quarterly sales report, a churn prediction model, or a quantitative trading strategy — follows the same six-stage pipeline: Load → Explore → Clean → Transform → Analyze → Visualize. This chapter covered each stage systematically, and the tools you learned map directly onto them:

Stage Tools Covered
Load pd.read_csv()
Explore .head(), .tail(), .shape, .dtypes, .describe()
Select / Filter df['col'], df[['c1','c2']], df[cond], .loc[], .iloc[]
Transform df['new'] = ..., list comprehension columns, .drop(), train/test split
Analyze .mean(), .std(), .corr(), .pct_change(), .rolling(), .cumprod()
Visualize .plot(), kind='hist', kind='bar', kind='box'

The top 10 pandas operations every analyst knows

  1. df.describe() — summary statistics for all numeric columns
  2. df.isna().sum() — count missing values per column
  3. df[condition] — filter rows by boolean expression
  4. df[['c1','c2']] — project specific columns
  5. df.loc[row_label, col_label] — label-based selection
  6. df['new'] = df['a'] + df['b'] — create a computed column
  7. df.dropna(subset=['col']) — drop rows where a key column is missing
  8. df['col'].fillna(df['col'].median()) — impute with median
  9. df['col'].pct_change() — daily returns from prices
  10. df['col'].rolling(20).mean() — 20-period moving average

Decision guide: which method to use when?

For missing values: - Missing < 5% and MCAR → dropna() - Missing 5–20% and MAR → fillna(mean) or fillna(median) - Time series → ffill() (carry forward last valid value) - Missing > 20% → investigate why before acting; consider dropping the column

For row selection: - You know the row label (name, date) → .loc[] - You know the row position (first 100, last 20) → .iloc[] - You want rows satisfying a condition → df[df['col'] > threshold]

For standardization: - Algorithm is distance-sensitive (k-means, kNN, SVM, neural net) → standardize first - Distribution is roughly normal → z-score - Need bounded [0, 1] output → min-max - Algorithm is tree-based (decision tree, random forest) → no standardization needed

For train/test split: - Time series data → sequential split (iloc[:train_size] / iloc[train_size:]), never shuffle - Cross-sectional data → df.sample(frac=1, random_state=42) then split

What’s next: Chapter 3 — Linear Regression

Chapter 3 builds directly on everything in this chapter. You will use .pct_change() to compute stock returns, .dropna() to clean the data, train/test splitting to evaluate your model, and .plot() to visualize fitted vs. actual values. The new tool is the linear regression model itself — but the pandas scaffolding around it is entirely what you learned here. Students who are solid on DataFrames find Chapter 3 straightforward; students who skipped practicing Chapter 2 find it confusing. There is no substitute for running the code cells yourself, on your own data, until the operations feel automatic.

Further reading

  • McKinney, W. (2022). Python for Data Analysis, 3rd ed. O’Reilly. — Written by the creator of pandas; the definitive reference for everything in this chapter and more.
  • Wickham, H. (2014). “Tidy Data.” Journal of Statistical Software, 59(10). — The paper that formalized the idea that data should be organized as “one row per observation, one column per variable” — which is exactly the DataFrame structure pandas is built around.
  • Rubin, D. B. (1976). “Inference and Missing Data.” Biometrika, 63(3), 581–592. — The foundational paper on the MCAR/MAR/MNAR taxonomy for missing data.
  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley. — The book that established exploratory visualization as a first-class activity in data analysis, not an afterthought.

← PreviousPython Essentials Next →Linear Regression

📖 Back to Contents

 

Prof. Xuhu Wan · HKUST ISOM · Intro to Business Analytics