โณ Loading Python Engine...

๐Ÿ“Š Day 06 : Sets

๐ŸŽฏ Enterprise Objective

Sets are the secret weapon of efficient Python code. While lists are for sequences, sets are for membership and uniqueness. Today we master set theory operations and discover how switching from a list to a set can speed up your data processing pipelines by 1000x.

๐Ÿ“‹ Strategic Overview

#TopicKey ConceptCore Use Case
1CreationUnordered, unique, {} or set()Deduplication
2Operations`\, &, -, ^`Cohort Analysis
3Methodsadd(), discard()Dynamic updating
4RelationsA <= B, issubset()Validation
5Frozen Setsfrozenset(), immutableDict keys, sets of sets
6PerformanceO(1) vs O(N)Pipeline optimization

1. Set Creation & Basics : Unordered, Unique Elements

๐Ÿ” What is it?

A set is an unordered collection of unique, hashable elements. Sets are defined by curly braces {} (but without key-value pairs) or the set() constructor. Their superpower is fast O(1) membership testing and automatic deduplication.

Creation MethodExampleNote
Curly braces{1, 2, 3}Values must be hashable
set() constructorset([1, 2, 2, 3])Automatically deduplicates
Empty setset(){} creates an empty dict!
Comprehension{x**2 for x in range(5)}Set comprehension

๐Ÿ’ผ Why Data Analysts Care

โ€ข Data Deduplication: unique_users = set(user_ids) โ€” instant deduplication

โ€ข Membership Testing: Checking if an item exists is exponentially faster than lists

โ€ข Vocabulary Building: Finding unique words in a document corpus

โš ๏ธ Empty Set Trap

{} creates an empty dictionary, not an empty set. You must use set() to create an empty set.

๐Ÿง  Pro Tip

Sets can only contain hashable (immutable) items. You cannot put a list or a dictionary inside a set, but you can put a tuple.

In [ ]:

๐Ÿงช Concept Checks: Set Creation

Q1. Create a set of your 3 favorite colors. Then create an empty set. Print their types.

In [ ]:

Q2. Convert the string "abracadabra" to a set. Print it. Explain why the length is much shorter.

In [ ]:

Q3. Try to create a set containing a list: s = {1, 2, [3, 4]}. Catch the TypeError and print the error message (unhashable type).

In [ ]:

Q4. Create a set containing a tuple: s = {1, 2, (3, 4)}. Print the set. Why does this work but lists do not?

In [ ]:

Q5. Use a set comprehension to create a set of the lengths of these words: ["data", "science", "python", "data", "code"]. Print it.

In [ ]:

2. Set Theory Operations : Mathematical Relational Logic

๐Ÿ” What is it?

Python sets support mathematical operations like union, intersection, difference, and symmetric difference. These can be performed using operators (e.g., |, &) or methods (e.g., .union()). Operators require both operands to be sets, while methods accept any iterable.

OperationOperatorMethodMeaning
Union`A \B`A.union(B)Elements in A or B (or both)
IntersectionA & BA.intersection(B)Elements in both A and B
DifferenceA - BA.difference(B)Elements in A but not in B
Symmetric DiffA ^ BA.symmetric_difference(B)Elements in exactly one set

๐Ÿ’ผ Why Data Analysts Care

โ€ข Cohort Analysis: Find users who bought Product A AND Product B (Intersection)

โ€ข Churn Analysis: Users active last month MINUS users active this month (Difference)

โ€ข A/B Testing: Combine unique user IDs from two experiment arms (Union)

โš ๏ธ Order Matters in Difference

A - B is not the same as B - A. The difference operation is asymmetric.
In [ ]:

๐Ÿงช Concept Checks: Set Operations

Q1. Given engineers = {"Alice", "Bob", "Charlie"} and managers = {"Charlie", "David", "Eve"}, find who is both an engineer and a manager.

In [ ]:

Q2. Using the same sets, find all unique employees across both roles (Union). Print the result.

In [ ]:

Q3. Find employees who are engineers but NOT managers. Find employees who are managers but NOT engineers. Print both.

In [ ]:

Q4. Find employees who hold exactly ONE role (not both). Print the symmetric difference.

In [ ]:

Q5. Demonstrate that operators require sets: try {1, 2} | [2, 3]. Then show how to fix it using the .union() method.

In [ ]:

3. Set Methods & Modification : Dynamic Set Updates

๐Ÿ” What is it?

Sets are mutable. You can add or remove elements using methods. Unlike lists, sets have no concept of index or order, so you cannot use append() or pop(i).

MethodActionNote
.add(x)Add one itemDoes nothing if already exists
.update(iter)Add multiple itemsLike .extend() for lists
.remove(x)Remove item xRaises KeyError if x not found
.discard(x)Remove item xDoes nothing if x not found (safe)
.pop()Remove & return random itemRaises KeyError if empty
.clear()Remove all itemsLeaves set()

๐Ÿ’ผ Why Data Analysts Care

โ€ข Incremental Processing: Add seen IDs to a set to track processed records: seen.add(row_id)

โ€ข Safe Cleanup: Use .discard() to remove elements without needing if item in set: checks

โš ๏ธ remove() vs discard()

Always use .discard() unless you specifically want your program to crash when trying to remove an element that isn't there.

In [ ]:

๐Ÿงช Concept Checks: Set Methods

Q1. Start with s = set(). Add elements 1, 2, and 2 again. Print the set.

In [ ]:

Q2. Update the set with a list [3, 4, 5] and a string "hello". Print the result. Notice how the string is handled.

In [ ]:

Q3. Demonstrate the difference between .remove() and .discard() by trying to remove an element that does not exist. Use a try/except block for .remove().

In [ ]:

Q4. Use a while loop and .pop() to empty a set of 5 elements, printing each popped item. Print the final set to confirm it is empty.

In [ ]:

Q5. Write a function add_if_missing(s, item) that adds an item and returns True if it was added, or False if it was already there.

In [ ]:

4. Subset & Superset Relations : Comparative Logic

๐Ÿ” What is it?

Sets can be compared to check if one is fully contained within another. A set is a subset if all its elements exist in the other set. It is a superset if it contains all elements of the other set. Two sets are disjoint if they share no common elements.

OperationOperatorMethodTrue if...
SubsetA <= BA.issubset(B)All elements of A are in B
Proper SubsetA < BNoneSubset AND A != B
SupersetA >= BA.issuperset(B)A contains all elements of B
DisjointNoneA.isdisjoint(B)A & B is empty

๐Ÿ’ผ Why Data Analysts Care

โ€ข Permission Checking: if user_roles.issubset(allowed_roles): โ€” verify access rights

โ€ข Data Validation: if required_columns.issubset(df.columns): โ€” ensure data structure is correct

โ€ข Overlap Checking: if isdisjoint โ€” verify two user segments are truly mutually exclusive

๐Ÿง  Pro Tip

Use <= and >= for subsets/supersets. A < B checks for a proper subset (A is a subset of B, but A is not equal to B).

In [ ]:

๐Ÿงช Concept Checks: Subset & Superset

Q1. Given required = {"age", "name"}, check if it is a subset of columns = {"age", "name", "salary"} using both method and operator.

In [ ]:

Q2. Check if columns is a superset of required using both method and operator. Print results.

In [ ]:

Q3. Demonstrate the difference between a subset (<=) and a proper subset (<) using A = {1, 2} and B = {1, 2}.

In [ ]:

Q4. Create two sets of tags for two different articles. Check if they have zero overlap using .isdisjoint(). Print the result.

In [ ]:

Q5. Write a function has_all_permissions(user_perms, required_perms) that returns True only if the user has all required permissions. Test it.

In [ ]:

5. Frozen Sets : Immutable Sets

๐Ÿ” What is it?

A frozenset is an immutable version of a set. Once created, elements cannot be added or removed. Because it is immutable, a frozenset is hashable, meaning it can be used as a dictionary key or as an element inside another set.

Featuresetfrozenset
Mutable?YesNo
Hashable?NoYes
Use as dict key?NoYes
Put inside a set?NoYes
MethodsAll set methodsOnly non-modifying methods (union, etc.)

๐Ÿ’ผ Why Data Analysts Care

โ€ข Compound Keys: distances = {frozenset(['NY', 'LA']): 2444} โ€” order-independent keys

โ€ข Caching: Use frozenset as arguments to cached/memoized functions

โ€ข Immutable Constraints: Prevent accidental modification of global rule sets

โš ๏ธ Sets of Sets

You cannot create a set of sets like {{1, 2}, {3, 4}} because sets are unhashable. You MUST use frozensets: {frozenset([1, 2]), frozenset([3, 4])}.

In [ ]:

๐Ÿงช Concept Checks: Frozen Sets

Q1. Create a frozenset from a list [1, 2, 3]. Try to use .add(4) and catch the AttributeError. Print the error message.

In [ ]:

Q2. Demonstrate that standard sets cannot be dict keys (catch TypeError), but frozensets can.

In [ ]:

Q3. Create a dictionary where keys are frozensets of two users, and values are their relationship status. Retrieve a value using the reverse order of users.

In [ ]:

Q4. Show that frozenset([1, 2, 3]) | frozenset([3, 4, 5]) works and returns a frozenset. Print the type of the result.

In [ ]:

Q5. Create a set containing three frozensets. Print the set.

In [ ]:

6. Sets vs Lists : O(1) Lookups & Performance

๐Ÿ” What is it?

Sets use hash tables under the hood. This means checking if an item is in a set takes O(1) time (constant time), regardless of how large the set is. Checking if an item is in a list takes O(n) time, which gets slower as the list grows.

OperationList Time ComplexitySet Time ComplexitySpeedup
x in collectionO(n)O(1)Massive for large data
DeduplicateO(n^2) (naive)O(n)Huge
IterationO(n)O(n)Equal

๐Ÿ’ผ Why Data Analysts Care

โ€ข Filtering Pipelines: Convert blacklist of IDs to a set before filtering a large DataFrame

โ€ข Cross-referencing: Checking which users in Table A exist in Table B

โ€ข Performance Optimization: Replacing if item in list: with if item in set: is the easiest performance win in Python

๐Ÿง  Pro Tip

If you need to check membership (in) inside a loop over a large dataset, ALWAYS convert the lookup collection to a set first.

In [ ]:

๐Ÿงช Concept Checks: Performance

Q1. Recreate the performance benchmark: time looking up -1 (not in collection) in a list of 1,000,000 items vs a set. Print the times.

In [ ]:

Q2. Given a huge text, extracting unique words using list.append (if not in list) vs set.add. Explain why the set approach is O(n) and list approach is O(n^2).

In [ ]:

Q3. Write a function find_common(l1, l2) that finds common elements between two large lists. Convert them to sets and use intersection. Time the function.

In [ ]:

Q4. Write a loop that filters a list of 100,000 numbers, keeping only those present in a valid_ids list of 10,000 numbers. Show how converting valid_ids to a set speeds it up.

In [ ]:

Q5. When would you NOT use a set for lookups? (Hint: Order preservation, duplicate counting). Explain briefly.

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Data Deduplicator): Write a function unique_sorted_chars(text) that takes a string, removes duplicates, removes spaces, and returns a sorted list of the unique characters. Test with 'hello world'.

In [ ]:

Task 2 (Cohort Overlap): Given cohort_A = [1,2,3,4,5,6] and cohort_B = [4,5,6,7,8], use sets to find: IDs in both, IDs only in A, IDs only in B, and IDs in exactly one cohort. Print all 4 lists.

In [ ]:

Task 3 (Vocabulary Analyzer): Write a function jaccard_similarity(doc1, doc2) that splits two strings into sets of words and calculates: len(intersection) / len(union). Test with two short sentences.

In [ ]:

Task 4 (Access Controller): Given user_roles = {'read', 'write'} and admin_roles = {'read', 'write', 'delete'}, write code that uses subset/superset methods to check if the user has all admin roles, and if not, which ones are missing.

In [ ]:

Task 5 (Fast Filter): Create a list of 100,000 random integers. Create a list of 5,000 'blacklist' integers. Filter the main list to remove blacklisted integers using a set for O(1) lookups. Print the final length.

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

Write a function has_duplicates(lst) that returns True if a list contains duplicates, using sets. Do it in one line.

In [ ]:

Q2.

Write a function missing_numbers(lst, n) that finds all missing numbers from 1 to n in a list. Use sets.

In [ ]:

Q3.

Write a function intersection_multiple(*sets) that returns the intersection of an arbitrary number of sets.

In [ ]:

Q4.

Write a function symmetric_diff_multiple(*sets) that finds elements present in exactly one of the sets.

In [ ]:

Q5.

Write a function is_pangram(text) that checks if a string contains every letter of the alphabet using sets.

In [ ]:

Q6.

Given two lists of dictionaries, find the symmetric difference based on a specific key (e.g., 'id').

In [ ]:

Q7.

Write a function remove_duplicates_preserve_order(lst) without using external libraries, but using a set for O(1) lookups.

In [ ]:

Q8.

Implement the Jaccard similarity index for two lists of items. Handle empty sets gracefully.

In [ ]:

Q9.

Write a function that takes a string and returns the length of the longest substring without repeating characters (sliding window + set).

In [ ]:

Q10.

Write a function find_pairs_sum(lst, target) that finds all unique pairs that sum to target, using a set for O(n) time.

In [ ]:

Q11.

Write a function find_common_chars(words) that returns a list of characters present in all words in a list.

In [ ]:

Q12.

Write code to flatten a list of sets into a single set: [{1,2}, {2,3}, {3,4}] -> {1,2,3,4}.

In [ ]:

Q13.

Explain the difference between s.remove(x) and s.discard(x). Write a wrapper function safe_remove(s, x) that mimics discard using remove.

In [ ]:

Q14.

Write a function is_anagram_set_trap(s1, s2) that shows why set(s1) == set(s2) is NOT a valid way to check for anagrams.

In [ ]:

Q15.

Create a dictionary acting as an undirected graph, where edges are represented by frozensets. frozenset([A, B]): weight.

In [ ]:

Q16.

Given a list of lists, deduplicate the inner lists. [[1,2], [2,1], [3,4]] -> [[1,2], [3,4]]. Use frozensets.

In [ ]:

Q17.

Write a function group_identical(lst) that groups identical dictionaries in a list using frozensets of their items.

In [ ]:

Q18.

Write a function set_from_generator(n) that creates a set of the first n prime numbers using a generator expression inside set().

In [ ]:

Q19.

Explain why {[]} raises a TypeError but {()} does not. Write code demonstrating this hashability concept.

In [ ]:

Q20.

Write a function longest_consecutive_sequence(nums) that runs in O(n) time using a set.

In [ ]:

Q21.

Write a function to find the elements that appear exactly once in an array where every other element appears twice. (Can use XOR or sets).

In [ ]:

Q22.

Write a function word_break(s, word_dict) that checks if a string can be segmented into words from a dictionary. Convert dict to set.

In [ ]:

Q23.

Write a function is_bipartite(graph) using two sets to color nodes and detect odd-length cycles.

In [ ]:

Q24.

Implement a custom MySet class wrapping a dictionary to demonstrate how sets work under the hood.

In [ ]:

Q25.

Write a function count_unique_vowels(word) using set intersection with {'a','e','i','o','u'}.

In [ ]:

๐Ÿ“Š Day 6 Executive Summary

#TopicKey TakeawayProfessional Application
1Creationset(), {1, 2}. Mutable, unordered.Data deduplication
2OperationsUnion `, Intersection &, Difference -`Cohort and overlap analysis
3Methodsadd, discard (safe), remove (unsafe)Incremental processing
4Relationsissubset <=, issuperset >=Data validation, permissions
5Frozen SetsImmutable, hashable setsSets of sets, Dict keys
6PerformanceO(1) membership testing (in)Massive optimization over lists

โœ… Instructor's End-of-Day Checklist

โ€ข [ ] I understand that sets are unordered and cannot be accessed by index.

โ€ข [ ] I know how to deduplicate a list instantly using set().

โ€ข [ ] I can perform union, intersection, and difference operations.

โ€ข [ ] I understand the difference between .remove() and .discard().

โ€ข [ ] I know when to use a frozenset (dict keys, sets of sets).

โ€ข [ ] I understand why in set is exponentially faster than in list.

โ€ข [ ] I have completed all 5 practice tasks.

โ€ข [ ] I have reviewed all 25 interview questions.