โณ Loading Python Engine...

๐Ÿ“Š Day 18 : Regex

๐ŸŽฏ Enterprise Objective

Regular Expressions (Regex) are the ultimate tool for text manipulation. While notoriously cryptic, mastering regex allows you to perform data cleaning tasks in 1 line of code that would otherwise take 50 lines of complex string manipulation.

๐Ÿ“‹ Strategic Overview

#TopicConcept
1Basics\d, \w, re.findall
2Quantifiers+, *, ?, ^, $
3Groups[abc], (capture)

1. Regex Basics : Pattern Matching

๐Ÿ” What is it?

Regular Expressions (Regex) are a mini-language for matching text patterns. The re module allows you to find, extract, and replace complex strings that standard methods like .replace() cannot handle.

CharacterMatchesExample
\dAny digit (0-9)\d\d\d (matches 123)
\wWord character (a-z, 0-9, _)\w+ (matches 'hello_1')
\sWhitespace\s+ (matches multiple spaces)
.Any character... (matches any 3 chars)

๐Ÿ’ผ Why Data Analysts Care

โ€ข Data Cleaning: Extracting phone numbers or zip codes from messy text fields

โ€ข Validation: Ensuring a user's input strictly matches an email format

โš ๏ธ Raw Strings

Always prefix regex patterns with an r (e.g., r'\d+'). This tells Python it's a 'raw string', preventing it from confusing regex backslashes with Python escape characters (like \n).

In [ ]:

๐Ÿงช Concept Checks: Regex Basics

Q1. Import re. Use re.findall() with r"\d+" to extract all numbers from "I have 2 apples and 10 bananas".

In [ ]:

Q2. Use re.sub() to replace all numbers in the string above with "X". Print the result.

In [ ]:

Q3. Find all words in "The quick brown fox" using re.findall() and r"\w+".

In [ ]:

Q4. Try matching the literal period in "www.google.com". Why does r"." match everything? How do you escape it r"\."?

In [ ]:

Q5. Why is the r prefix important in regex strings? What happens to print("\n") vs print(r"\n")?

In [ ]:

2. Quantifiers & Anchors : Advanced Patterns

๐Ÿ” What is it?
Quantifiers dictate how many times a character should occur. Anchors lock the match to the start or end of the string.
SymbolMeaningExample
*0 or morea* (matches '', 'a', 'aa')
+1 or morea+ (matches 'a', 'aa')
?0 or 1 (optional)colou?r (matches 'color', 'colour')
{x,y}Between x and y times\d{3,4} (matches 123, 1234)
^ / $Start / End of string^Hello / world$

๐Ÿ’ผ Why Data Analysts Care

โ€ข Log Parsing: ^ERROR ensures we only match lines that start with ERROR, not lines that just contain the word

โ€ข Flexible Parsing: https?:// matches both http and https URLs

๐Ÿง  Pro Tip

By default, and + are greedy (they match as much as possible). Append a ? to make them lazy: .? matches as little as possible.

In [ ]:

๐Ÿงช Concept Checks: Quantifiers

Q1. Match strings that start with "ID-". Test on ["ID-123", "User ID-456"] using ^ID- with re.search().

In [ ]:

Q2. Match words that end in "ing". Test on "running and jumping" using \w+ing.

In [ ]:

Q3. Use the optional ? to match both "file.txt" and "files.txt".

In [ ]:

Q4. Use the {n} quantifier to match exactly a 5-digit zip code in "My zip is 90210 or 1234".

In [ ]:

Q5. Demonstrate greedy vs lazy: run re.findall(r"<.>", "text") and then r"<.?>".

In [ ]:

3. Groups & Sets : Extracting Structure

๐Ÿ” What is it?
Character Sets `[]` allow you to match any ONE of the characters inside. Capture Groups `()` allow you to extract specific parts of a pattern.
SyntaxMeaningExample
[abc]'a' or 'b' or 'c'b[aeiou]t (bat, bit, but)
[A-Z]Range (capital letters)[A-Za-z0-9] (Alphanumeric)
[^a]Negation (NOT 'a')[^0-9] (Non-digits)
(abc)Capture group(\d{3})-(\d{4}) (Extracts parts)

๐Ÿ’ผ Why Data Analysts Care

โ€ข ETL Pipelines: Extracting the Year, Month, and Day from a messy date string into separate variables

โ€ข Data Scrubbing: Removing all punctuation using re.sub(r'[^A-Za-z0-9\s]', '', text)

๐Ÿง  Pro Tip

When using re.findall() with capture groups (), it returns a list of tuples instead of strings, containing exactly the captured components!

In [ ]:

๐Ÿงช Concept Checks: Groups

Q1. Match any vowel in "hello world" using re.findall() and a character set [aeiou].

In [ ]:

Q2. Match any consonant in "hello world" using the negation set [^aeiou\s].

In [ ]:

Q3. Extract the domain names from "user1@gmail.com" and "admin@yahoo.com" using capture groups r"@(\w+\.\w+)".

In [ ]:

Q4. Use re.findall() with r"(\d+)\s(USD|EUR)" to extract amount and currency from "100 USD and 50 EUR".

In [ ]:

Q5. Extract only words starting with a capital letter from "The Quick brown Fox" using [A-Z][a-z]+.

In [ ]:

๐Ÿ› ๏ธ Professional Practice Tasks

Theory is useless without muscle memory. Complete these tasks to solidify your understanding.

Task 1 (Email Extractor): Given a long messy string, write a regex to extract all valid email addresses. (Hint: [a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+).

In [ ]:

Task 2 (Phone Normalizer): Given ["(555) 123-4567", "555-123-4567", "5551234567"], use re.sub() and capture groups to normalize them all to "555-123-4567" format.

In [ ]:

๐Ÿ’ป Pure Coding Interview Questions

Q1.

What is the difference between re.match(), re.search(), and re.findall()?

In [ ]:

Q2.

Explain the difference between greedy and lazy matching. How do you make a quantifier lazy?

In [ ]:

Q3.

What is a raw string r"" in Python, and why is it almost mandatory for regex?

In [ ]:

Q4.

Write a regex to validate an IPv4 address (e.g., '192.168.1.1').

In [ ]:

Q5.

Explain what \b (word boundary) does. Give an example where it is necessary.

In [ ]:

Q6.

Write a regex to extract all hashtags from a tweet.

In [ ]:

Q7.

How do you compile a regex pattern using re.compile()? Why is this good for performance?

In [ ]:

Q8.

What is the re.IGNORECASE flag and how do you use it?

In [ ]:

Q9.

Write a regex to match a valid 24-hour time format (e.g., '23:59', but not '25:99').

In [ ]:

Q10.

Explain capture groups. How do you refer to a capture group in a re.sub() replacement string? (Hint: \1).

In [ ]:

Q11.

What is a non-capturing group (?:...) and why would you use it?

In [ ]:

Q12.

Explain positive lookahead (?=...) and negative lookahead (?!...).

In [ ]:

Q13.

Write a regex using negative lookahead to match 'foo' only if it is NOT followed by 'bar'.

In [ ]:

Q14.

Write a regex to extract the query parameters from a URL.

In [ ]:

Q15.

How do you match a string that contains exactly 5 letters, no more, no less? (Hint: anchors).

In [ ]:

Q16.

Write a regex to parse a CSV line, considering that some fields might be enclosed in quotes.

In [ ]:

Q17.

Explain the re.MULTILINE flag. How does it change the behavior of ^ and $?

In [ ]:

Q18.

Write a regex to find all duplicate words in a sentence (e.g., 'This is is a test'). (Hint: backreferences).

In [ ]:

Q19.

How do you split a string by multiple different delimiters (e.g., space, comma, semicolon) at once?

In [ ]:

Q20.

Write a regex to match a valid hex color code (e.g., '#FFF' or '#AABBCC').

In [ ]:

Q21.

Explain how re.finditer() differs from re.findall(). When should you use it?

In [ ]:

Q22.

Write a regex to check if a string contains only alphanumeric characters, without using str.isalnum().

In [ ]:

Q23.

How do you escape special regex characters (like *, ?, () dynamically if they are stored in a variable? (Hint: re.escape).

In [ ]:

Q24.

Write a regex to match Python single-line comments.

In [ ]:

Q25.

Why is regex generally a bad idea for parsing complex nested structures like HTML or JSON?

In [ ]: