โณ Loading Python Engine...

๐Ÿ“Š Day 26 : Pandas GroupBy

๐ŸŽฏ Enterprise Objective

Aggregating data is the core of analytics. You must be able to summarize millions of rows into executive dashboards. Today we master the groupby mechanics to slice data by categories, calculate advanced .agg() summaries, and perform SQL-style Window Functions using .transform().

๐Ÿ“‹ Strategic Overview

#TopicConcept
1GroupBySplit-Apply-Combine
2Aggregation.agg({'A':'sum'})
3TransformWindow Functions

1. The GroupBy Mechanics : Split-Apply-Combine

๐Ÿ” What is it?

The .groupby() method is identical to the SQL GROUP BY statement. It follows the Split-Apply-Combine pattern: it splits the data into groups based on a key, applies an aggregation function (like sum or mean), and combines the results into a new DataFrame.

# Calculate the average salary per department
# 1. Split by Dept | 2. Select Salary | 3. Apply mean()
avg_salary = df.groupby('Dept')['Salary'].mean()

๐Ÿ’ผ Why Data Analysts Care

โ€ข Financial Reporting: Calculating total monthly revenue: df.groupby('Month')['Revenue'].sum()

โ€ข User Behavior: Finding the maximum session length per user: df.groupby('UserID')['Duration'].max()

โš ๏ธ Unaggregated GroupBys

If you just print df.groupby('Dept'), you will get a memory address. You MUST apply an aggregation function (like .sum()) to actually see data!

In [ ]:

๐Ÿงช Concept Checks: GroupBy

Q1. Create a df with City and Sales. Group by City and calculate .sum(). Print it.

In [ ]:

Q2. Group by City and calculate .mean(). Print it.

In [ ]:

Q3. Group by City and calculate .count() to see how many transactions happened in each city.

In [ ]:

Q4. Print a raw df.groupby("City") object. Observe the memory address output.

In [ ]:

Q5. Group by City, select the Sales column, and find the .max() value per city.

In [ ]:

2. Multiple Aggregations (.agg) : Advanced Summaries

๐Ÿ” What is it?

Sometimes you want multiple statistics at once (e.g., the Mean AND the Max). You can pass a list of functions to the .agg() method. You can also pass a dictionary to apply different functions to different columns!

# Min and Max for Salary
df.groupby('Dept')['Salary'].agg(['min', 'max'])

# Different math for different columns
df.groupby('Dept').agg({'Salary': 'mean', 'Experience': 'max'})

๐Ÿ’ผ Why Data Analysts Care

โ€ข Executive Dashboards: Showing the Total Sales, Average Order Value, and Count of Orders per region in one table

๐Ÿง  Pro Tip

GroupBy outputs have the group keys (e.g., 'Dept') as the Index. If you want a flat standard DataFrame, append .reset_index() to the end of your groupby chain!

In [ ]:

๐Ÿงช Concept Checks: Aggregations

Q1. Group df by Dept. Use .agg() to find the min and max of Salary.

In [ ]:

Q2. Use .agg() with a dictionary to find the sum of Salary and the mean of Experience.

In [ ]:

Q3. Add .reset_index() to your answer in Q2 and observe how Dept becomes a normal column.

In [ ]:

Q4. Create a custom aggregation: df.groupby("Dept")["Salary"].agg(lambda x: x.max() - x.min()) to find the salary range.

In [ ]:

Q5. Group by TWO columns at once: df.groupby(["Dept", "JobTitle"]).sum(). (Assume a JobTitle column exists).

In [ ]:

3. Transform & Apply : Row-Level Group Math

๐Ÿ” What is it?

While .agg() collapses data into a summary table, .transform() returns data the exact same shape as the original. This is used for calculating group-level metrics and broadcasting them back to the original rows (like SQL Window Functions).

# Calculate the Dept Mean, and broadcast it to every employee's row
df['Dept_Avg'] = df.groupby('Dept')['Salary'].transform('mean')

# Now you can calculate how much above/below average they are!
df['Above_Avg'] = df['Salary'] - df['Dept_Avg']

๐Ÿ’ผ Why Data Analysts Care

โ€ข Standardization: Z-Score normalization per category: (x - group_mean) / group_std

โ€ข Percent to Total: Calculating what percentage an employee's salary contributes to their department's total budget

๐Ÿง  Pro Tip

Use .agg() when you want to reduce dimensions (create a summary report). Use .transform() when you want to add new calculated columns to your existing DataFrame.

In [ ]:

๐Ÿงช Concept Checks: Transform

Q1. Create df with Team and Score. Use .transform("max") to add a column Team_Max_Score.

In [ ]:

Q2. Create a column Is_Top_Scorer which is True if Score == Team_Max_Score.

In [ ]:

Q3. Use .transform("sum") to calculate Team_Total. Then create %_of_Total = Score / Team_Total.

In [ ]:

Q4. What happens if you try to assign .agg("sum") to a new column? Catch the ValueError (shapes don't match).

In [ ]:

Q5. Explain the SQL equivalent of .transform() (Hint: OVER (PARTITION BY ...) Window Functions).

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Sales Report): Given a DF with Date, Region, Rep, and Sales. Group by Region and calculate the Total Sales and Average Sales. Reset the index.

In [ ]:

Task 2 (Multi-Level Grouping): Group the Sales DF by BOTH Region and Rep. Calculate the sum of Sales. Notice the MultiIndex created. Use .reset_index() to flatten it.

In [ ]:

Task 3 (Custom Aggregation): Write a custom function spread(series) that returns series.max() - series.min(). Pass this function into .agg() when grouping by Region.

In [ ]:

Task 4 (Percent to Total): Given a DF of Category and Revenue. Use .transform('sum') to find the total revenue per category. Create a new column Pct_Rev representing that row's percentage contribution to its category.

In [ ]:

Task 5 (Filtering Groups): Group by Region. Use .filter(lambda g: g['Sales'].sum() > 10000) to drop all rows belonging to regions that didn't meet a 10k quota. (Note: .filter is a powerful GroupBy method!).

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

Explain the 'Split-Apply-Combine' paradigm in the context of Pandas GroupBy.

In [ ]:

Q2.

What is the difference between df.groupby('A').sum() and df.groupby('A')['B'].sum()?

In [ ]:

Q3.

Why does .groupby() return a DataFrameGroupBy object instead of a DataFrame?

In [ ]:

Q4.

What does .reset_index() do after a groupby operation?

In [ ]:

Q5.

How do you group by multiple columns? (e.g., Year and Month).

In [ ]:

Q6.

Explain the difference between .agg() and .transform().

In [ ]:

Q7.

How do you apply different aggregation functions to different columns in a single .agg() call?

In [ ]:

Q8.

Write code to group by Column A, and get the size (count of rows) of each group using .size().

In [ ]:

Q9.

What is the difference between .count() and .size() on a GroupBy object?

In [ ]:

Q10.

How do you use .filter() on a GroupBy object to drop entire groups based on a condition?

In [ ]:

Q11.

What is a MultiIndex? How is it created during a .groupby()?

In [ ]:

Q12.

Write code to find the first occurrence (row) of each group. (.first()).

In [ ]:

Q13.

How do you calculate a rolling average within groups? (e.g., df.groupby('User')['Login'].rolling(3).mean()).

In [ ]:

Q14.

Explain how to use the as_index=False parameter in .groupby(). What does it replace?

In [ ]:

Q15.

Write a custom lambda function inside .agg() to calculate the 90th percentile of a group.

In [ ]:

Q16.

How do you iterate over groups in a GroupBy object? (for name, group in grouped:).

In [ ]:

Q17.

What happens if you group by a column that contains NaN values? (Hint: dropna=True is the default).

In [ ]:

Q18.

Explain how .transform() is equivalent to SQL Window Functions (OVER PARTITION BY).

In [ ]:

Q19.

Write code to fill NaN values in 'Salary' with the mean 'Salary' of that specific 'Department' using .transform().

In [ ]:

Q20.

How do you get the N-th row of each group? (.nth(n)).

In [ ]:

Q21.

What is pd.Grouper and how is it used for grouping Time Series data by frequency (e.g., Monthly)?

In [ ]:

Q22.

Explain the nunique() aggregation function. When is it useful?

In [ ]:

Q23.

How do you pivot a grouped DataFrame from long format to wide format? (.unstack()).

In [ ]:

Q24.

Write code to calculate the Cumulative Sum within each group. (.cumsum()).

In [ ]:

Q25.

How do you sort the output of a GroupBy aggregation by the aggregated values?

In [ ]:

๐Ÿ“Š Day 26 Executive Summary

#TopicKey Takeaway
1GroupBydf.groupby('Key')['Value'].sum() is the most common analytics code
2AggPass lists ['min', 'max'] or dicts to perform complex math
3TransformReturns data in the original shape. Perfect for calculating % of Total

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I can group by a column and calculate the mean.

โ€ข [ ] I can use .agg() for multiple statistics.

โ€ข [ ] I understand the difference between .agg() and .transform().