โณ Loading Python Engine...

๐Ÿ“Š Day 24 : Pandas Selection

๐ŸŽฏ Enterprise Objective

Knowing how to navigate tables is the most heavily tested skill in Data Analyst interviews. Today we master precise row/column extraction using .loc and .iloc, SQL-style filtering using Boolean Masks and .query(), and leaderboard generation via sorting.

๐Ÿ“‹ Strategic Overview

#TopicConcept
1Loc/IlocExact Indexing
2Masking/QueryConditional Filtering
3Sortingsort_values, nlargest

1. Loc and Iloc : Row and Column Selection

๐Ÿ” What is it?

Selecting data in Pandas requires precision. .iloc[] selects by integer position (like standard Python lists). .loc[] selects by index label and column name. Both follow the format [rows, columns].

AccessorParadigmExampleMeaning
.ilocInteger Positiondf.iloc[0:5, 0:2]First 5 rows, first 2 columns
.locLabel / Namedf.loc[:, 'Age':'City']All rows, columns from Age to City

๐Ÿ’ผ Why Data Analysts Care

โ€ข Feature Selection: Extracting the target variable Y = df['Price'] and the features X = df.iloc[:, :-1] for Machine Learning

โš ๏ธ Inclusive Loc

A major 'gotcha': Slicing with .iloc[0:5] is EXCLUSIVE of 5 (returns 0,1,2,3,4). But slicing with .loc['A':'C'] is INCLUSIVE of 'C' (returns A,B,C).

In [ ]:

๐Ÿงช Concept Checks: Loc / Iloc

Q1. Given df, use .iloc to select the first 3 rows and the first 2 columns.

In [ ]:

Q2. Use .loc to select all rows (:), but only the "Name" and "Salary" columns.

In [ ]:

Q3. What happens if you try df.iloc[0, "Name"]? Catch the TypeError (iloc requires integers).

In [ ]:

Q4. Use .loc to slice rows from "ID2" to "ID4" inclusive. Print the result.

In [ ]:

Q5. Extract the single scalar value in row 0, column 1 using .iloc[0, 1]. Print it.

In [ ]:

2. Boolean Filtering : Querying Data

๐Ÿ” What is it?

You can filter DataFrames exactly like NumPy arrays using Boolean Masks. Create a condition, and pass it into the DataFrame brackets: df[condition]. Pandas also provides the highly readable .query() method for SQL-like string queries.

# Standard Masking
rich_users = df[df['Salary'] > 75000]

# Multiple conditions (Requires parenthesis and bitwise &)
targets = df[(df['Age'] > 30) & (df['City'] == 'NY')]

# The elegant .query() alternative
targets = df.query("Age > 30 and City == 'NY'")

๐Ÿ’ผ Why Data Analysts Care

โ€ข Cohort Extraction: Filtering a massive dataset down to only active users from a specific country

โ€ข Date Filtering: df[df['Date'] >= '2024-01-01'] to extract year-to-date performance

๐Ÿง  Pro Tip

When using .query(), you can refer to external Python variables by prefixing them with an @ symbol: df.query('Age > @min_age').

In [ ]:

๐Ÿงช Concept Checks: Filtering

Q1. Create a mask for Age >= 30. Use it to filter df and print the result.

In [ ]:

Q2. Filter df for people who are in "IT" OR (|) "Sales". (Remember parentheses around conditions).

In [ ]:

Q3. Use the .isin() method to achieve the exact same result as Q2. Print it.

In [ ]:

Q4. Use the .query() method to find rows where Age < 40 and Dept == "HR". Print the result.

In [ ]:

Q5. Filter the DataFrame to keep rows where the Name starts with "A" using df[df["Name"].str.startswith("A")].

In [ ]:

3. Sorting and Ranking : Ordering Data

๐Ÿ” What is it?

Once data is filtered, we often need to sort it to find the top/bottom performers using .sort_values(). You can sort by multiple columns and specify ascending/descending order.

MethodPurposeExample
sort_values(by=)Sort rowsdf.sort_values(by='Age', ascending=False)
nlargest(n, col)Top N rowsdf.nlargest(5, 'Salary') (Faster than sorting!)
rank()Assign ranksdf['Salary'].rank(ascending=False)

๐Ÿ’ผ Why Data Analysts Care

โ€ข Leaderboards: Finding the top 10 highest revenue generating products

โ€ข Time Series: Ensuring financial data is strictly sorted chronologically before calculating moving averages

๐Ÿง  Pro Tip

If you just need the top 5 values, df.nlargest(5, 'Col') is computationally faster than sorting the entire million-row DataFrame and calling .head(5).

In [ ]:

๐Ÿงช Concept Checks: Sorting

Q1. Sort df by "Time" in ascending order (fastest to slowest). Print the result.

In [ ]:

Q2. Sort df by "Name" alphabetically. Print it.

In [ ]:

Q3. Use .nsmallest(2, "Time") to find the two fastest times. Print the result.

In [ ]:

Q4. Create a new column "Rank" using df["Score"].rank(ascending=False, method="min"). Print df.

In [ ]:

Q5. Sort the DataFrame by the index (if it was scrambled) using df.sort_index(). Print it.

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Matrix Extraction): Create a 5x5 DataFrame of random numbers. Use .iloc to extract a 3x3 subset from the very middle of the DataFrame.

In [ ]:

Task 2 (Cohort Filter): Create a DataFrame of 10 users with Age and Country. Filter for users Age > 18 AND Country == 'USA'. Extract only their Name column using .loc[mask, 'Name'].

In [ ]:

Task 3 (Dynamic Query): Create a variable min_salary = 60000. Use df.query("Salary > @min_salary") to filter a DataFrame of employees. Print the result.

In [ ]:

Task 4 (Top Performers): Create a DataFrame of 100 random student scores. Find the top 5 scores using .nlargest(). Then sort the whole DataFrame and use .head(5). Verify they match.

In [ ]:

Task 5 (Text Filtering): Create a DataFrame of product names. Filter the DataFrame to only keep rows where the product name contains the word 'Pro' using df['Product'].str.contains('Pro').

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

What is the difference between .loc and .iloc?

In [ ]:

Q2.

Explain why .iloc[0:2] returns 2 rows, but .loc[0:2] might return 3 rows (if the index is integers).

In [ ]:

Q3.

How do you filter a DataFrame based on values in a list? (Hint: .isin()).

In [ ]:

Q4.

Write a boolean mask to filter out rows where a specific column contains NaN. (Hint: .notna()).

In [ ]:

Q5.

What is the .query() method and why might you choose it over standard boolean masking?

In [ ]:

Q6.

How do you reference external variables inside a .query() string?

In [ ]:

Q7.

Explain what SettingWithCopyWarning is. How do you prevent it using .copy()?

In [ ]:

Q8.

Write code to sort a DataFrame by Column A descending, and then Column B ascending.

In [ ]:

Q9.

What is the difference between sort_values and sort_index?

In [ ]:

Q10.

Why is nlargest(5, 'A') generally preferred over sort_values('A').head(5)?

In [ ]:

Q11.

How do you select a single scalar value from a DataFrame extremely fast? (Hint: .at and .iat).

In [ ]:

Q12.

Write code to filter a DataFrame using a regular expression on a string column. (.str.contains(regex=True)).

In [ ]:

Q13.

How do you select columns based on their data type? (df.select_dtypes(include='number')).

In [ ]:

Q14.

Explain how the ~ operator is used in Pandas boolean masking.

In [ ]:

Q15.

Write code to invert a boolean mask.

In [ ]:

Q16.

How do you update the values of a column only for specific rows using .loc?

In [ ]:

Q17.

What happens if you assign a list of values to a new column, but the list length doesn't match the DataFrame length?

In [ ]:

Q18.

Explain the rank() method. How does it handle tied values by default?

In [ ]:

Q19.

How do you filter a DataFrame based on the length of a string in a column? (df['Col'].str.len() > 5).

In [ ]:

Q20.

Write code to drop all rows where ANY column has a missing value.

In [ ]:

Q21.

How do you reset the index of a DataFrame after filtering it? Why is drop=True important?

In [ ]:

Q22.

Explain how boolean masking leverages NumPy's vectorized operations under the hood.

In [ ]:

Q23.

What is a MultiIndex (Hierarchical Index)? How do you select data from it using .loc?

In [ ]:

Q24.

Write code to select every alternate row in a DataFrame using .iloc and step slicing.

In [ ]:

Q25.

How do you randomly shuffle the rows of a DataFrame using .sample()?

In [ ]:

๐Ÿ“Š Day 24 Executive Summary

#TopicKey Takeaway
1loc/ilociloc is integer position (exclusive end). loc is label name (inclusive end).
2Maskingdf[df['A'] > 5] filters rows. Combine with & and ``.
3Querydf.query("A > 5") is highly readable for complex logic.

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I can extract rows and columns using .loc and .iloc.

โ€ข [ ] I can filter a DataFrame using multiple conditions.

โ€ข [ ] I can sort data and find the top N records.