๐ Day 18 : Regex
๐ฏ Enterprise Objective
Regular Expressions (Regex) are the ultimate tool for text manipulation. While notoriously cryptic, mastering regex allows you to perform data cleaning tasks in 1 line of code that would otherwise take 50 lines of complex string manipulation.
๐ Strategic Overview
| # | Topic | Concept |
|---|---|---|
| 1 | Basics | \d, \w, re.findall |
| 2 | Quantifiers | +, *, ?, ^, $ |
| 3 | Groups | [abc], (capture) |
1. Regex Basics : Pattern Matching
Regular Expressions (Regex) are a mini-language for matching text patterns. The re module allows you to find, extract, and replace complex strings that standard methods like .replace() cannot handle.
| Character | Matches | Example |
|---|---|---|
\d | Any digit (0-9) | \d\d\d (matches 123) |
\w | Word character (a-z, 0-9, _) | \w+ (matches 'hello_1') |
\s | Whitespace | \s+ (matches multiple spaces) |
. | Any character | ... (matches any 3 chars) |
๐ผ Why Data Analysts Care
โข Data Cleaning: Extracting phone numbers or zip codes from messy text fields
โข Validation: Ensuring a user's input strictly matches an email format
โ ๏ธ Raw Strings
Always prefix regex patterns with an r (e.g., r'\d+'). This tells Python it's a 'raw string', preventing it from confusing regex backslashes with Python escape characters (like \n).
๐งช Concept Checks: Regex Basics
Q1. Import re. Use re.findall() with r"\d+" to extract all numbers from "I have 2 apples and 10 bananas".
Q2. Use re.sub() to replace all numbers in the string above with "X". Print the result.
Q3. Find all words in "The quick brown fox" using re.findall() and r"\w+".
Q4. Try matching the literal period in "www.google.com". Why does r"." match everything? How do you escape it r"\."?
Q5. Why is the r prefix important in regex strings? What happens to print("\n") vs print(r"\n")?
2. Quantifiers & Anchors : Advanced Patterns
Quantifiers dictate how many times a character should occur. Anchors lock the match to the start or end of the string.
| Symbol | Meaning | Example |
|---|---|---|
* | 0 or more | a* (matches '', 'a', 'aa') |
+ | 1 or more | a+ (matches 'a', 'aa') |
? | 0 or 1 (optional) | colou?r (matches 'color', 'colour') |
{x,y} | Between x and y times | \d{3,4} (matches 123, 1234) |
^ / $ | Start / End of string | ^Hello / world$ |
๐ผ Why Data Analysts Care
โข Log Parsing: ^ERROR ensures we only match lines that start with ERROR, not lines that just contain the word
โข Flexible Parsing: https?:// matches both http and https URLs
๐ง Pro Tip
By default, and + are greedy (they match as much as possible). Append a ? to make them lazy: .? matches as little as possible.
๐งช Concept Checks: Quantifiers
Q1. Match strings that start with "ID-". Test on ["ID-123", "User ID-456"] using ^ID- with re.search().
Q2. Match words that end in "ing". Test on "running and jumping" using \w+ing.
Q3. Use the optional ? to match both "file.txt" and "files.txt".
Q4. Use the {n} quantifier to match exactly a 5-digit zip code in "My zip is 90210 or 1234".
Q5. Demonstrate greedy vs lazy: run re.findall(r"<.>", "text") and then r"<.?>".
3. Groups & Sets : Extracting Structure
Character Sets `[]` allow you to match any ONE of the characters inside. Capture Groups `()` allow you to extract specific parts of a pattern.
| Syntax | Meaning | Example |
|---|---|---|
[abc] | 'a' or 'b' or 'c' | b[aeiou]t (bat, bit, but) |
[A-Z] | Range (capital letters) | [A-Za-z0-9] (Alphanumeric) |
[^a] | Negation (NOT 'a') | [^0-9] (Non-digits) |
(abc) | Capture group | (\d{3})-(\d{4}) (Extracts parts) |
๐ผ Why Data Analysts Care
โข ETL Pipelines: Extracting the Year, Month, and Day from a messy date string into separate variables
โข Data Scrubbing: Removing all punctuation using re.sub(r'[^A-Za-z0-9\s]', '', text)
๐ง Pro Tip
When using re.findall() with capture groups (), it returns a list of tuples instead of strings, containing exactly the captured components!
๐งช Concept Checks: Groups
Q1. Match any vowel in "hello world" using re.findall() and a character set [aeiou].
Q2. Match any consonant in "hello world" using the negation set [^aeiou\s].
Q3. Extract the domain names from "user1@gmail.com" and "admin@yahoo.com" using capture groups r"@(\w+\.\w+)".
Q4. Use re.findall() with r"(\d+)\s(USD|EUR)" to extract amount and currency from "100 USD and 50 EUR".
Q5. Extract only words starting with a capital letter from "The Quick brown Fox" using [A-Z][a-z]+.
๐ ๏ธ Professional Practice Tasks
Theory is useless without muscle memory. Complete these tasks to solidify your understanding.
Task 1 (Email Extractor): Given a long messy string, write a regex to extract all valid email addresses. (Hint: [a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+).
Task 2 (Phone Normalizer): Given ["(555) 123-4567", "555-123-4567", "5551234567"], use re.sub() and capture groups to normalize them all to "555-123-4567" format.
Task 3 (HTML Tag Stripper): Write a function Task 4 (Password Validator): Write a regex that matches a password ONLY if it has: 8+ chars, at least one digit, and at least one uppercase letter. (Use Task 5 (Log Parser): Given strip_tags(html) that removes all HTML tags (like
re.search and positive lookaheads if you dare, or just multiple standard regex checks).log = "[2023-10-15 08:22:11] ERROR: Server crashed", write a regex with 3 capture groups to extract the Date, Time, and Log Level (ERROR).
๐ป Pure Coding Interview Questions
Q1.
What is the difference between re.match(), re.search(), and re.findall()?
Q2.
Explain the difference between greedy and lazy matching. How do you make a quantifier lazy?
Q3.
What is a raw string r"" in Python, and why is it almost mandatory for regex?
Q4.
Write a regex to validate an IPv4 address (e.g., '192.168.1.1').
Q5.
Explain what \b (word boundary) does. Give an example where it is necessary.
Q6.
Write a regex to extract all hashtags from a tweet.
Q7.
How do you compile a regex pattern using re.compile()? Why is this good for performance?
Q8.
What is the re.IGNORECASE flag and how do you use it?
Q9.
Write a regex to match a valid 24-hour time format (e.g., '23:59', but not '25:99').
Q10.
Explain capture groups. How do you refer to a capture group in a re.sub() replacement string? (Hint: \1).
Q11.
What is a non-capturing group (?:...) and why would you use it?
Q12.
Explain positive lookahead (?=...) and negative lookahead (?!...).
Q13.
Write a regex using negative lookahead to match 'foo' only if it is NOT followed by 'bar'.
Q14.
Write a regex to extract the query parameters from a URL.
Q15.
How do you match a string that contains exactly 5 letters, no more, no less? (Hint: anchors).
Q16.
Write a regex to parse a CSV line, considering that some fields might be enclosed in quotes.
Q17.
Explain the re.MULTILINE flag. How does it change the behavior of ^ and $?
Q18.
Write a regex to find all duplicate words in a sentence (e.g., 'This is is a test'). (Hint: backreferences).
Q19.
How do you split a string by multiple different delimiters (e.g., space, comma, semicolon) at once?
Q20.
Write a regex to match a valid hex color code (e.g., '#FFF' or '#AABBCC').
Q21.
Explain how re.finditer() differs from re.findall(). When should you use it?
Q22.
Write a regex to check if a string contains only alphanumeric characters, without using str.isalnum().
Q23.
How do you escape special regex characters (like *, ?, () dynamically if they are stored in a variable? (Hint: re.escape).
Q24.
Write a regex to match Python single-line comments.
Q25.
Why is regex generally a bad idea for parsing complex nested structures like HTML or JSON?
๐ Day 18 Executive Summary
| # | Topic | Key Takeaway |
|---|---|---|
| 1 | Raw Strings | Always use r"pattern" |
| 2 | Anchors | ^ start, $ end |
| 3 | Extracting | () creates tuples of extracted data |
โ Instructor's End-of-Day Checklist
โข [ ] I understand \d, \w, \s.
โข [ ] I can use +, *, and ?.
โข [ ] I can extract data using capture groups ().