Data Operations Reference Sheet

This reference sheet summarizes some (not all!) of the data manipulation tasks you’ll do most often throughout the course.

Numpy arrays: format + shape

X.shape   # tuple, e.g. (100, 5)
X.ndim    # number of dimensions, e.g. 2
X.dtype   # element type, e.g. float64

Pandas: DataFrame + Series, named columns + named index

df.shape     # (n_rows, n_cols)
df.columns   # column names
df.index     # row labels (may be a RangeIndex, dates, IDs, ...)
df.dtypes    # column data types

Common operations (numpy vs pandas)

import numpy as np
import pandas as pd

X = np.array([[1, 10.0],
              [2, 20.0],
              [3,  np.nan]])

df = pd.DataFrame({"a": [1, 2, 3],
                   "g": ["A", "A", "B"],
                   "b": [10.0, 20.0, np.nan]},
                  index=["r1", "r2", "r3"])

Basic operation (what + why)	How in numpy (example)	How in pandas (example)
Inspect size/shape (sanity-check what you loaded)	`X.shape` \(\to\) `(3, 2)`	`df.shape` \(\to\) `(3, 2)`
Inspect “labels” (know what columns/rows mean)	(no built-in labels)	`df.columns` `df.index`
Select a column (work with one feature)	`x = X[:, 0]` \(\to\) 1D array	`s = df["a"]` \(\to\) Series
Select multiple columns (subset features)	`X2 = X[:, [0, 1]]` \(\to\) 2D array	`df2 = df[["a", "b"]]` \(\to\) DataFrame
Select rows by named index (use meaningful row labels/IDs)	(no built-in labels)	`row = df.loc["r2"]`
Select rows by position (grab a specific sample)	`row = X[1]`	`row = df.iloc[1]`
Filter rows by condition (keep only rows meeting a rule)	`mask = X[:, 0] > 1` `Xpos = X[mask]`	`dfpos = df[df["a"] > 1]`
Sort by a column (e.g. to rank)	`idx = np.argsort(X[:, 0])` `Xs = X[idx]`	`dfs = df.sort_values(by="a")`
Sort while keeping correspondence (sort `X` and `y` together)	`idx = np.argsort(X[:, 0])` `Xs = X[idx]` `ys = y[idx]`	`dfs = df.sort_values("a")` `ys = y.loc[dfs.index]`
Conditionally assign values (create/update a column using a rule)	`X2 = np.where(X > 0, X, 0)`	`df["b2"] = df["b"].where(df["b"] > 0, 0)`
Conditionally get indices (find which rows match a rule)	`idx = np.where(X[:, 0] > 1)[0]` \(\to\) row indices	`idx = df.index[df["a"] > 1]` \(\to\) index labels
Compute summary statistics (describe features)	`m = np.mean(X, axis=0)`	`m = df.mean(numeric_only=True)`
Argmax/argmin (index of max/min; e.g. “best/worst” row)	`i_max = np.argmax(X[:, 0])` `i_min = np.argmin(X[:, 0])`	`i_max = df["a"].idxmax()` `i_min = df["a"].idxmin()`
Group + aggregate (summarize by category)	(not a core numpy pattern)	`means = df.groupby("g")["b"].mean()`
Stack/concat columns (combine features)	`X3 = np.column_stack([X1, X2])`	`df3 = pd.concat([df1, df2], axis=1)`
Reshape 1D \(\leftrightarrow\) 2D (match API expectations like `(n_rows, 1)`)	`x_col = x.reshape(-1, 1)` \(\to\) shape `(n_rows, 1)` `x_row = x.reshape(1, -1)` \(\to\) shape `(1, n_rows)` `x_1d = X.reshape(-1,)` \(\to\) shape `(n_rows*n_cols,)`	`s = df["a"]` \(\to\) Series (1D) `df[["a"]]` \(\to\) DataFrame (2D)
Create a new feature (feature engineering)	`new_feature = X[:,0] * X[:,1]` `Xnew = np.column_stack([X, new_feature])`	`df = df.assign(new_feature=df["a"] * df["b"])`
Ordinal-encode categories (preserve order info)	`map_ = {"low":1,"med":2,"high":3}` `x = np.array([map_[v] for v in vals])`	`map_ = {"low":1,"med":2,"high":3}` `s = df["level"].map(map_)`
One-hot encode categories (no implied ordering)	(not a core numpy pattern)	`df_ohe = pd.get_dummies(df, columns=["level"], dtype=int)`
Read from a file (load data)	`X = np.load("X.npy")` or `np.loadtxt` for text	`df = pd.read_csv("data.csv")` common: `sep`, `header`, `index_col`
Convert between numpy \(\leftrightarrow\) pandas (use the right tool)	`df = pd.DataFrame(X, columns=["a","b"])`	`X = df.to_numpy()`

Inspect size/shape
(sanity-check what you loaded)

X.shape
\(\to\) (3, 2)

df.shape
\(\to\) (3, 2)

Inspect “labels”
(know what columns/rows mean)

(no built-in labels)

df.columns
df.index

Select a column
(work with one feature)

x = X[:, 0]
\(\to\) 1D array

s = df["a"]
\(\to\) Series

Select multiple columns
(subset features)

X2 = X[:, [0, 1]]
\(\to\) 2D array

df2 = df[["a", "b"]]
\(\to\) DataFrame

Select rows by named index
(use meaningful row labels/IDs)

(no built-in labels)

row = df.loc["r2"]

Select rows by position
(grab a specific sample)

row = X[1]

row = df.iloc[1]

Filter rows by condition
(keep only rows meeting a rule)

mask = X[:, 0] > 1
Xpos = X[mask]

dfpos = df[df["a"] > 1]

Sort by a column
(e.g. to rank)

idx = np.argsort(X[:, 0])
Xs = X[idx]

dfs = df.sort_values(by="a")

Sort while keeping correspondence
(sort X and y together)

idx = np.argsort(X[:, 0])
Xs = X[idx]
ys = y[idx]

dfs = df.sort_values("a")
ys = y.loc[dfs.index]

Conditionally assign values
(create/update a column using a rule)

X2 = np.where(X > 0, X, 0)

df["b2"] = df["b"].where(df["b"] > 0, 0)

Conditionally get indices
(find which rows match a rule)

idx = np.where(X[:, 0] > 1)[0]
\(\to\) row indices

idx = df.index[df["a"] > 1]
\(\to\) index labels

Compute summary statistics
(describe features)

m = np.mean(X, axis=0)

m = df.mean(numeric_only=True)

Argmax/argmin
(index of max/min; e.g. “best/worst” row)

i_max = np.argmax(X[:, 0])
i_min = np.argmin(X[:, 0])

i_max = df["a"].idxmax()
i_min = df["a"].idxmin()

Group + aggregate
(summarize by category)

(not a core numpy pattern)

means = df.groupby("g")["b"].mean()

Stack/concat columns
(combine features)

X3 = np.column_stack([X1, X2])

df3 = pd.concat([df1, df2], axis=1)

Reshape 1D \(\leftrightarrow\) 2D
(match API expectations like (n_rows, 1))

x_col = x.reshape(-1, 1) \(\to\) shape (n_rows, 1)
x_row = x.reshape(1, -1) \(\to\) shape (1, n_rows)
x_1d = X.reshape(-1,) \(\to\) shape (n_rows*n_cols,)

s = df["a"] \(\to\) Series (1D)
df[["a"]] \(\to\) DataFrame (2D)

Create a new feature
(feature engineering)

new_feature = X[:,0] * X[:,1]
Xnew = np.column_stack([X, new_feature])

df = df.assign(new_feature=df["a"] * df["b"])

Ordinal-encode categories
(preserve order info)

map_ = {"low":1,"med":2,"high":3}
x = np.array([map_[v] for v in vals])

map_ = {"low":1,"med":2,"high":3}
s = df["level"].map(map_)

One-hot encode categories
(no implied ordering)

(not a core numpy pattern)

df_ohe = pd.get_dummies(df, columns=["level"], dtype=int)

Read from a file
(load data)

X = np.load("X.npy")
or np.loadtxt for text

df = pd.read_csv("data.csv")
common: sep, header, index_col

Convert between numpy \(\leftrightarrow\) pandas
(use the right tool)

df = pd.DataFrame(X, columns=["a","b"])

X = df.to_numpy()

Data Operations Reference Sheet

Numpy arrays: format + `shape`

Pandas: `DataFrame` + `Series`, named columns + named index

Common operations (numpy vs pandas)

Numpy arrays: format + shape

Pandas: DataFrame + Series, named columns + named index

Common operations (numpy vs pandas)

Numpy arrays: format + `shape`

Pandas: `DataFrame` + `Series`, named columns + named index