Day 18: Regex — Manodemy

🎯 Enterprise Objective

Regular Expressions (Regex) are the ultimate tool for text manipulation. While notoriously cryptic, mastering regex allows you to perform data cleaning tasks in 1 line of code that would otherwise take 50 lines of complex string manipulation.

📋 Strategic Overview

#	Topic	Concept
1	Basics	`\d`, `\w`, `re.findall`
2	Quantifiers	`+`, `*`, `?`, `^`, `$`
3	Groups	`[abc]`, `(capture)`

1. Regex Basics : Pattern Matching

🔍 What is it?

Regular Expressions (Regex) are a mini-language for matching text patterns. The re module allows you to find, extract, and replace complex strings that standard methods like .replace() cannot handle.

Character	Matches	Example
`\d`	Any digit (0-9)	`\d\d\d` (matches 123)
`\w`	Word character (a-z, 0-9, _)	`\w+` (matches 'hello_1')
`\s`	Whitespace	`\s+` (matches multiple spaces)
`.`	Any character	`...` (matches any 3 chars)

💼 Why Data Analysts Care

• Data Cleaning: Extracting phone numbers or zip codes from messy text fields

• Validation: Ensuring a user's input strictly matches an email format

⚠️ Raw Strings

Always prefix regex patterns with an r (e.g., r'\d+'). This tells Python it's a 'raw string', preventing it from confusing regex backslashes with Python escape characters (like \n).

In [ ]:

🧪 Concept Checks: `Regex Basics`

Q1. Import re. Use re.findall() with r"\d+" to extract all numbers from "I have 2 apples and 10 bananas".

In [ ]:

Q2. Use re.sub() to replace all numbers in the string above with "X". Print the result.

In [ ]:

Q3. Find all words in "The quick brown fox" using re.findall() and r"\w+".

In [ ]:

Q4. Try matching the literal period in "www.google.com". Why does r"." match everything? How do you escape it r"\."?

In [ ]:

Q5. Why is the r prefix important in regex strings? What happens to print("\n") vs print(r"\n")?

In [ ]:

2. Quantifiers & Anchors : Advanced Patterns

🔍 What is it?
Quantifiers dictate how many times a character should occur. Anchors lock the match to the start or end of the string.

Symbol	Meaning	Example
`*`	0 or more	`a*` (matches '', 'a', 'aa')
`+`	1 or more	`a+` (matches 'a', 'aa')
`?`	0 or 1 (optional)	`colou?r` (matches 'color', 'colour')
`{x,y}`	Between x and y times	`\d{3,4}` (matches 123, 1234)
`^` / `$`	Start / End of string	`^Hello` / `world$`

💼 Why Data Analysts Care

• Log Parsing: ^ERROR ensures we only match lines that start with ERROR, not lines that just contain the word

• Flexible Parsing: https?:// matches both http and https URLs

🧠 Pro Tip

By default, and + are greedy (they match as much as possible). Append a ? to make them lazy: .? matches as little as possible.

In [ ]:

🧪 Concept Checks: `Quantifiers`

Q1. Match strings that start with "ID-". Test on ["ID-123", "User ID-456"] using ^ID- with re.search().

In [ ]:

Q2. Match words that end in "ing". Test on "running and jumping" using \w+ing.

In [ ]:

Q3. Use the optional ? to match both "file.txt" and "files.txt".

In [ ]:

Q4. Use the {n} quantifier to match exactly a 5-digit zip code in "My zip is 90210 or 1234".

In [ ]:

Q5. Demonstrate greedy vs lazy: run re.findall(r"<.>", "text") and then r"<.?>".

In [ ]:

3. Groups & Sets : Extracting Structure

🔍 What is it?
Character Sets `[]` allow you to match any ONE of the characters inside. Capture Groups `()` allow you to extract specific parts of a pattern.

Syntax	Meaning	Example
`[abc]`	'a' or 'b' or 'c'	`b[aeiou]t` (bat, bit, but)
`[A-Z]`	Range (capital letters)	`[A-Za-z0-9]` (Alphanumeric)
`[^a]`	Negation (NOT 'a')	`[^0-9]` (Non-digits)
`(abc)`	Capture group	`(\d{3})-(\d{4})` (Extracts parts)

💼 Why Data Analysts Care

• ETL Pipelines: Extracting the Year, Month, and Day from a messy date string into separate variables

• Data Scrubbing: Removing all punctuation using re.sub(r'[^A-Za-z0-9\s]', '', text)

🧠 Pro Tip

When using re.findall() with capture groups (), it returns a list of tuples instead of strings, containing exactly the captured components!

In [ ]:

🧪 Concept Checks: `Groups`

Q1. Match any vowel in "hello world" using re.findall() and a character set [aeiou].

In [ ]:

Q2. Match any consonant in "hello world" using the negation set [^aeiou\s].

In [ ]:

Q3. Extract the domain names from "user1@gmail.com" and "admin@yahoo.com" using capture groups r"@(\w+\.\w+)".

In [ ]:

Q4. Use re.findall() with r"(\d+)\s(USD|EUR)" to extract amount and currency from "100 USD and 50 EUR".

In [ ]:

Q5. Extract only words starting with a capital letter from "The Quick brown Fox" using [A-Z][a-z]+.

In [ ]:

🛠️ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Email Extractor): Given a long messy string, write a regex to extract all valid email addresses. (Hint: [a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+).

In [ ]:

Task 2 (Phone Normalizer): Given ["(555) 123-4567", "555-123-4567", "5551234567"], use re.sub() and capture groups to normalize them all to "555-123-4567" format.

In [ ]:

Task 3 (HTML Tag Stripper): Write a function strip_tags(html) that removes all HTML tags (like

or ) using a lazy regex r"<.*?>" and re.sub(). Test it on a sample string.

#	Topic	Key Takeaway
1	Raw Strings	Always use `r"pattern"`
2	Anchors	`^` start, `$` end
3	Extracting	`()` creates tuples of extracted data

📊 Day 18 : Regex

🎯 Enterprise Objective

📋 Strategic Overview

1. Regex Basics : Pattern Matching

💼 Why Data Analysts Care

⚠️ Raw Strings

🧪 Concept Checks: Regex Basics

2. Quantifiers & Anchors : Advanced Patterns

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Quantifiers

3. Groups & Sets : Extracting Structure

💼 Why Data Analysts Care

🧠 Pro Tip

🧪 Concept Checks: Groups

🛠️ Professional Practice Tasks

💻 Pure Coding Interview Questions

What is the difference between re.match(), re.search(), and re.findall()?

Explain the difference between greedy and lazy matching. How do you make a quantifier lazy?

What is a raw string r"" in Python, and why is it almost mandatory for regex?

Write a regex to validate an IPv4 address (e.g., '192.168.1.1').

Explain what \b (word boundary) does. Give an example where it is necessary.

Write a regex to extract all hashtags from a tweet.

How do you compile a regex pattern using re.compile()? Why is this good for performance?

What is the re.IGNORECASE flag and how do you use it?

Write a regex to match a valid 24-hour time format (e.g., '23:59', but not '25:99').

Explain capture groups. How do you refer to a capture group in a re.sub() replacement string? (Hint: \1).

What is a non-capturing group (?:...) and why would you use it?

Explain positive lookahead (?=...) and negative lookahead (?!...).

Write a regex using negative lookahead to match 'foo' only if it is NOT followed by 'bar'.

Write a regex to extract the query parameters from a URL.

How do you match a string that contains exactly 5 letters, no more, no less? (Hint: anchors).

Write a regex to parse a CSV line, considering that some fields might be enclosed in quotes.

Explain the re.MULTILINE flag. How does it change the behavior of ^ and $?

Write a regex to find all duplicate words in a sentence (e.g., 'This is is a test'). (Hint: backreferences).

How do you split a string by multiple different delimiters (e.g., space, comma, semicolon) at once?

Write a regex to match a valid hex color code (e.g., '#FFF' or '#AABBCC').

Explain how re.finditer() differs from re.findall(). When should you use it?

Write a regex to check if a string contains only alphanumeric characters, without using str.isalnum().

How do you escape special regex characters (like *, ?, () dynamically if they are stored in a variable? (Hint: re.escape).

Write a regex to match Python single-line comments.

Why is regex generally a bad idea for parsing complex nested structures like HTML or JSON?

📊 Day 18 Executive Summary

✅ Instructor's End-of-Day Checklist

🧪 Concept Checks: `Regex Basics`

🧪 Concept Checks: `Quantifiers`

🧪 Concept Checks: `Groups`

What is the difference between `re.match()`, `re.search()`, and `re.findall()`?

What is a raw string `r""` in Python, and why is it almost mandatory for regex?

Explain what `\b` (word boundary) does. Give an example where it is necessary.

How do you compile a regex pattern using `re.compile()`? Why is this good for performance?

What is the `re.IGNORECASE` flag and how do you use it?

Explain capture groups. How do you refer to a capture group in a `re.sub()` replacement string? (Hint: `\1`).

What is a non-capturing group `(?:...)` and why would you use it?

Explain positive lookahead `(?=...)` and negative lookahead `(?!...)`.

Explain the `re.MULTILINE` flag. How does it change the behavior of `^` and `$`?

Explain how `re.finditer()` differs from `re.findall()`. When should you use it?

Write a regex to check if a string contains only alphanumeric characters, without using `str.isalnum()`.

How do you escape special regex characters (like `*`, `?`, `(`) dynamically if they are stored in a variable? (Hint: `re.escape`).