Day 26: Pandas GroupBy

🎯 Enterprise Objective

Aggregating data is the core of analytics. You must be able to summarize millions of rows into executive dashboards. Today we master the groupby mechanics to slice data by categories, calculate advanced .agg() summaries, and perform SQL-style Window Functions using .transform().

📋 Strategic Overview

#	Topic	Concept
1	GroupBy	Split-Apply-Combine
2	Aggregation	`.agg({'A':'sum'})`
3	Transform	Window Functions

1. The GroupBy Mechanics : Split-Apply-Combine

🔍 What is it?

The .groupby() method is identical to the SQL GROUP BY statement. It follows the Split-Apply-Combine pattern: it splits the data into groups based on a key, applies an aggregation function (like sum or mean), and combines the results into a new DataFrame.

# Calculate the average salary per department
# 1. Split by Dept | 2. Select Salary | 3. Apply mean()
avg_salary = df.groupby('Dept')['Salary'].mean()

💼 Why Data Analysts Care

• Financial Reporting: Calculating total monthly revenue: df.groupby('Month')['Revenue'].sum()

• User Behavior: Finding the maximum session length per user: df.groupby('UserID')['Duration'].max()

⚠️ Unaggregated GroupBys

If you just print df.groupby('Dept'), you will get a memory address. You MUST apply an aggregation function (like .sum()) to actually see data!

In [ ]:

🧪 Concept Checks: `GroupBy`

Q1. Create a df with City and Sales. Group by City and calculate .sum(). Print it.

In [ ]:

Q2. Group by City and calculate .mean(). Print it.

In [ ]:

Q3. Group by City and calculate .count() to see how many transactions happened in each city.

In [ ]:

Q4. Print a raw df.groupby("City") object. Observe the memory address output.

In [ ]:

Q5. Group by City, select the Sales column, and find the .max() value per city.

In [ ]:

2. Multiple Aggregations (.agg) : Advanced Summaries

🔍 What is it?

Sometimes you want multiple statistics at once (e.g., the Mean AND the Max). You can pass a list of functions to the .agg() method. You can also pass a dictionary to apply different functions to different columns!

# Min and Max for Salary
df.groupby('Dept')['Salary'].agg(['min', 'max'])

# Different math for different columns
df.groupby('Dept').agg({'Salary': 'mean', 'Experience': 'max'})

💼 Why Data Analysts Care

• Executive Dashboards: Showing the Total Sales, Average Order Value, and Count of Orders per region in one table

🧠 Pro Tip

GroupBy outputs have the group keys (e.g., 'Dept') as the Index. If you want a flat standard DataFrame, append .reset_index() to the end of your groupby chain!

In [ ]:

🧪 Concept Checks: `Aggregations`

Q1. Group df by Dept. Use .agg() to find the min and max of Salary.

In [ ]:

Q2. Use .agg() with a dictionary to find the sum of Salary and the mean of Experience.

In [ ]:

Q3. Add .reset_index() to your answer in Q2 and observe how Dept becomes a normal column.

In [ ]:

Q4. Create a custom aggregation: df.groupby("Dept")["Salary"].agg(lambda x: x.max() - x.min()) to find the salary range.

In [ ]:

Q5. Group by TWO columns at once: df.groupby(["Dept", "JobTitle"]).sum(). (Assume a JobTitle column exists).

In [ ]:

3. Transform & Apply : Row-Level Group Math

🔍 What is it?

While .agg() collapses data into a summary table, .transform() returns data the exact same shape as the original. This is used for calculating group-level metrics and broadcasting them back to the original rows (like SQL Window Functions).

# Calculate the Dept Mean, and broadcast it to every employee's row
df['Dept_Avg'] = df.groupby('Dept')['Salary'].transform('mean')

# Now you can calculate how much above/below average they are!
df['Above_Avg'] = df['Salary'] - df['Dept_Avg']

💼 Why Data Analysts Care

• Standardization: Z-Score normalization per category: (x - group_mean) / group_std

• Percent to Total: Calculating what percentage an employee's salary contributes to their department's total budget

🧠 Pro Tip

Use .agg() when you want to reduce dimensions (create a summary report). Use .transform() when you want to add new calculated columns to your existing DataFrame.

In [ ]:

🧪 Concept Checks: `Transform`

Q1. Create df with Team and Score. Use .transform("max") to add a column Team_Max_Score.

In [ ]:

Q2. Create a column Is_Top_Scorer which is True if Score == Team_Max_Score.

In [ ]:

Q3. Use .transform("sum") to calculate Team_Total. Then create %_of_Total = Score / Team_Total.

In [ ]:

Q4. What happens if you try to assign .agg("sum") to a new column? Catch the ValueError (shapes don't match).

In [ ]:

Q5. Explain the SQL equivalent of .transform() (Hint: OVER (PARTITION BY ...) Window Functions).

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Sales Report): Given a DF with Date, Region, Rep, and Sales. Group by Region and calculate the Total Sales and Average Sales. Reset the index.

In [ ]:

Task 2 (Multi-Level Grouping): Group the Sales DF by BOTH Region and Rep. Calculate the sum of Sales. Notice the MultiIndex created. Use .reset_index() to flatten it.

In [ ]:

Task 3 (Custom Aggregation): Write a custom function spread(series) that returns series.max() - series.min(). Pass this function into .agg() when grouping by Region.

In [ ]:

Task 4 (Percent to Total): Given a DF of Category and Revenue. Use .transform('sum') to find the total revenue per category. Create a new column Pct_Rev representing that row's percentage contribution to its category.

In [ ]:

Task 5 (Filtering Groups): Group by Region. Use .filter(lambda g: g['Sales'].sum() > 10000) to drop all rows belonging to regions that didn't meet a 10k quota. (Note: .filter is a powerful GroupBy method!).

In [ ]:

💻 Pure Coding Interview Questions

Q1.

Explain the 'Split-Apply-Combine' paradigm in the context of Pandas GroupBy.

In [ ]:

Q2.

What is the difference between `df.groupby('A').sum()` and `df.groupby('A')['B'].sum()`?

In [ ]:

Q3.

Why does `.groupby()` return a `DataFrameGroupBy` object instead of a DataFrame?

In [ ]:

Q4.

What does `.reset_index()` do after a groupby operation?

In [ ]:

Q5.

How do you group by multiple columns? (e.g., Year and Month).

In [ ]:

Q6.

Explain the difference between `.agg()` and `.transform()`.

In [ ]:

Q7.

How do you apply different aggregation functions to different columns in a single `.agg()` call?

In [ ]:

Q8.

Write code to group by Column A, and get the size (count of rows) of each group using `.size()`.

In [ ]:

Q9.

What is the difference between `.count()` and `.size()` on a GroupBy object?

In [ ]:

Q10.

How do you use `.filter()` on a GroupBy object to drop entire groups based on a condition?

In [ ]:

Q11.

What is a `MultiIndex`? How is it created during a `.groupby()`?

In [ ]:

Q12.

Write code to find the first occurrence (row) of each group. (`.first()`).

In [ ]:

Q13.

How do you calculate a rolling average within groups? (e.g., `df.groupby('User')['Login'].rolling(3).mean()`).

In [ ]:

Q14.

Explain how to use the `as_index=False` parameter in `.groupby()`. What does it replace?

In [ ]:

Q15.

Write a custom lambda function inside `.agg()` to calculate the 90th percentile of a group.

In [ ]:

Q16.

How do you iterate over groups in a GroupBy object? (`for name, group in grouped:`).

In [ ]:

Q17.

What happens if you group by a column that contains `NaN` values? (Hint: `dropna=True` is the default).

In [ ]:

Q18.

Explain how `.transform()` is equivalent to SQL Window Functions (`OVER PARTITION BY`).

In [ ]:

Q19.

Write code to fill `NaN` values in 'Salary' with the mean 'Salary' of that specific 'Department' using `.transform()`.

In [ ]:

Q20.

How do you get the N-th row of each group? (`.nth(n)`).

In [ ]:

Q21.

What is `pd.Grouper` and how is it used for grouping Time Series data by frequency (e.g., Monthly)?

In [ ]:

Q22.

Explain the `nunique()` aggregation function. When is it useful?

In [ ]:

Q23.

How do you pivot a grouped DataFrame from long format to wide format? (`.unstack()`).

In [ ]:

Q24.

Write code to calculate the Cumulative Sum within each group. (`.cumsum()`).

In [ ]:

Q25.

How do you sort the output of a GroupBy aggregation by the aggregated values?

In [ ]:

📊 Day 26 Executive Summary

#	Topic	Key Takeaway
1	GroupBy	`df.groupby('Key')['Value'].sum()` is the most common analytics code
2	Agg	Pass lists `['min', 'max']` or dicts to perform complex math
3	Transform	Returns data in the original shape. Perfect for calculating `% of Total`

✅ Instructor's End-of-Day Checklist

• [ ] I can group by a column and calculate the mean.

• [ ] I can use .agg() for multiple statistics.

• [ ] I understand the difference between .agg() and .transform().

🚀 Continue to Day 27 - Pandas Merging & Joining

📊 Day 26 : Pandas GroupBy

🎯 Enterprise Objective

📋 Strategic Overview

1. The GroupBy Mechanics : Split-Apply-Combine

💼 Why Data Analysts Care

⚠️ Unaggregated GroupBys

🧪 Concept Checks: GroupBy

2. Multiple Aggregations (.agg) : Advanced Summaries

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Aggregations

3. Transform & Apply : Row-Level Group Math

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Transform

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

Explain the 'Split-Apply-Combine' paradigm in the context of Pandas GroupBy.

What is the difference between df.groupby('A').sum() and df.groupby('A')['B'].sum()?

Why does .groupby() return a DataFrameGroupBy object instead of a DataFrame?

What does .reset_index() do after a groupby operation?

How do you group by multiple columns? (e.g., Year and Month).

Explain the difference between .agg() and .transform().

How do you apply different aggregation functions to different columns in a single .agg() call?

Write code to group by Column A, and get the size (count of rows) of each group using .size().

What is the difference between .count() and .size() on a GroupBy object?

How do you use .filter() on a GroupBy object to drop entire groups based on a condition?

What is a MultiIndex? How is it created during a .groupby()?

Write code to find the first occurrence (row) of each group. (.first()).

How do you calculate a rolling average within groups? (e.g., df.groupby('User')['Login'].rolling(3).mean()).

Explain how to use the as_index=False parameter in .groupby(). What does it replace?

Write a custom lambda function inside .agg() to calculate the 90th percentile of a group.

How do you iterate over groups in a GroupBy object? (for name, group in grouped:).

What happens if you group by a column that contains NaN values? (Hint: dropna=True is the default).

Explain how .transform() is equivalent to SQL Window Functions (OVER PARTITION BY).

Write code to fill NaN values in 'Salary' with the mean 'Salary' of that specific 'Department' using .transform().

How do you get the N-th row of each group? (.nth(n)).

What is pd.Grouper and how is it used for grouping Time Series data by frequency (e.g., Monthly)?

Explain the nunique() aggregation function. When is it useful?

How do you pivot a grouped DataFrame from long format to wide format? (.unstack()).

Write code to calculate the Cumulative Sum within each group. (.cumsum()).

How do you sort the output of a GroupBy aggregation by the aggregated values?

📊 Day 26 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `GroupBy`

🧪 Concept Checks: `Aggregations`

🧪 Concept Checks: `Transform`

What is the difference between `df.groupby('A').sum()` and `df.groupby('A')['B'].sum()`?

Why does `.groupby()` return a `DataFrameGroupBy` object instead of a DataFrame?

What does `.reset_index()` do after a groupby operation?

Explain the difference between `.agg()` and `.transform()`.

How do you apply different aggregation functions to different columns in a single `.agg()` call?

Write code to group by Column A, and get the size (count of rows) of each group using `.size()`.

What is the difference between `.count()` and `.size()` on a GroupBy object?

How do you use `.filter()` on a GroupBy object to drop entire groups based on a condition?

What is a `MultiIndex`? How is it created during a `.groupby()`?

Write code to find the first occurrence (row) of each group. (`.first()`).

How do you calculate a rolling average within groups? (e.g., `df.groupby('User')['Login'].rolling(3).mean()`).

Explain how to use the `as_index=False` parameter in `.groupby()`. What does it replace?

Write a custom lambda function inside `.agg()` to calculate the 90th percentile of a group.

How do you iterate over groups in a GroupBy object? (`for name, group in grouped:`).

What happens if you group by a column that contains `NaN` values? (Hint: `dropna=True` is the default).

Explain how `.transform()` is equivalent to SQL Window Functions (`OVER PARTITION BY`).

Write code to fill `NaN` values in 'Salary' with the mean 'Salary' of that specific 'Department' using `.transform()`.

How do you get the N-th row of each group? (`.nth(n)`).

What is `pd.Grouper` and how is it used for grouping Time Series data by frequency (e.g., Monthly)?

Explain the `nunique()` aggregation function. When is it useful?

How do you pivot a grouped DataFrame from long format to wide format? (`.unstack()`).

Write code to calculate the Cumulative Sum within each group. (`.cumsum()`).