Regex Beginner Tutorial: Master Regular Expressions from Scratch

What Are Regular Expressions

Regular expressions — commonly called regex or regexp — are patterns used to match character combinations in strings. Think of them as a powerful search language: instead of searching for an exact word, you describe a pattern of characters, and the regex engine finds all strings that fit that pattern.

Regex is deeply embedded in nearly every programming language and developer tool. You will find it in JavaScript, Python, Java, C#, PHP, Ruby, Go, shell scripts with grep and sed, text editors like VS Code and Sublime Text, databases, and even spreadsheet formulas. Once you learn regex, you carry a superpower that works everywhere.

The syntax might look intimidating at first — a typical regex like ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ can seem like line noise. But regex is built from a small set of building blocks that fit together logically. This tutorial will walk you through each one.

Regex Fundamentals: The Building Blocks

Every regex is composed from three categories of tokens. Once you understand these, you can read and write any pattern.

1. Literal Characters

These are the simplest building blocks: they match exactly themselves. The regex cat matches the literal substring "cat" — nothing more, nothing less. It will match inside "catalog", "scatter", and "wildcat". Case matters: cat does not match "Cat" unless you enable case-insensitive mode (the i flag).

2. Metacharacters

Metacharacters have special meaning in regex and do not match themselves. The twelve regex metacharacters you need to know are:

Metacharacter	Name	Matches
`.`	Dot	Any single character except newline
`^`	Caret	Start of string (or start of line in multiline mode)
`$`	Dollar	End of string (or end of line in multiline mode)
`*`	Star	Zero or more of the preceding token
`+`	Plus	One or more of the preceding token
`?`	Question mark	Zero or one of the preceding token (makes it optional)
`{n,m}`	Curly braces	Between n and m occurrences of the preceding token
`\|`	Pipe	Alternation (logical OR) — matches left or right side
`( )`	Parentheses	Grouping and capturing
`[ ]`	Square brackets	Character class — matches any one character inside
`\`	Backslash	Escapes a metacharacter to match it literally

If you need to match a literal metacharacter, escape it with a backslash. For example, \. matches a literal period, and \+ matches a literal plus sign.

3. Shorthand Character Classes

Shorthand classes save typing for common character groupings:

\d  →  [0-9]                Any digit
\D  →  [^0-9]               Any non-digit
\w  →  [a-zA-Z0-9_]         Word character (letters, digits, underscore)
\W  →  [^a-zA-Z0-9_]        Non-word character
\s  →  [ \t\n\r\f\v]        Whitespace (space, tab, newline, etc.)
\S  →  [^ \t\n\r\f\v]       Non-whitespace

These are the backbone of real-world patterns. For example, \d{3}-\d{3}-\d{4} matches a US phone number format like "555-123-4567".

Practical Regex Patterns You Can Use Today

Learning regex theory is important, but nothing beats seeing how patterns solve real problems. Here are battle-tested regex patterns for common validation tasks, with explanations of how each one works.

Email Validation

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Breakdown:

^ — Start of the string.
[a-zA-Z0-9._%+-]+ — The local part (before the @): one or more letters, digits, dots, underscores, percent signs, plus signs, or hyphens.
@ — The literal at sign.
[a-zA-Z0-9.-]+ — The domain name: one or more letters, digits, dots, or hyphens.
\. — A literal dot separating the domain from the TLD.
[a-zA-Z]{2,} — The top-level domain: two or more letters (e.g., com, org, io, online).
$ — End of the string.

URL Matching

^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$

Breakdown:

^https?:\/\/ — Matches http:// or https://. The s? makes the "s" optional.
(www\.)? — Optionally matches www. at the start of the domain.
[-a-zA-Z0-9@:%._\+~#=]{1,256} — The domain name: 1 to 256 allowed characters.
\.[a-zA-Z0-9()]{1,6} — A dot followed by the TLD (1-6 characters).
\b — A word boundary to ensure the TLD ends cleanly.
([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$ — The optional path, query string, and fragment.

Phone Number Matching (International)

^\+?[\d\s\-().]{7,15}$

Breakdown:

^\+? — Optional leading plus sign for international prefix.
[\d\s\-().]{7,15} — Between 7 and 15 characters consisting of digits, spaces, hyphens, parentheses, or dots. This covers formats like +1 (555) 123-4567, 555.123.4567, and +44-20-7946-0958.
$ — End of string.

Strong Password Validation

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*()_+\-=\[\]{};':\"\\|,.<>\/?]).{8,}$

Breakdown:

^ — Start of string.
(?=.*[a-z]) — Positive lookahead: at least one lowercase letter exists somewhere ahead.
(?=.*[A-Z]) — Positive lookahead: at least one uppercase letter.
(?=.*\d) — Positive lookahead: at least one digit.
(?=.*[!@#$%^&*()_+\-=\[\]{};':\"\\|,.<>\/?]) — Positive lookahead: at least one special character.
.{8,} — Total length of 8 or more characters.
$ — End of string.

Date Format Extraction (YYYY-MM-DD)

\b\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])\b

Breakdown:

\b — Word boundary.
\d{4} — Exactly four digits for the year.
- — Literal hyphen.
(0[1-9]|1[0-2]) — Month: 01-09 or 10-12.
- — Literal hyphen.
(0[1-9]|[12]\d|3[01]) — Day: 01-09, 10-29, or 30-31.
\b — Word boundary.

Beyond the Basics: Advanced Techniques

Once you are comfortable with fundamental patterns, these techniques unlock significantly more expressive power.

Lookaheads and Lookbehinds

Lookaround assertions let you check for a pattern without consuming characters — the regex engine peeks ahead or behind without moving its current position in the string.

# Positive lookahead: match "q" only if followed by "u"
q(?=u)  →  matches "q" in "queen", not "q" in "Iraq"

# Negative lookahead: match "q" only if NOT followed by "u"
q(?!u)  →  matches "q" in "Iraq", not "q" in "queen"

# Positive lookbehind: match digits only if preceded by "$"
(?<=\$)\d+  →  matches "100" in "$100", not "100" in "abc100"

# Negative lookbehind: match digits NOT preceded by "$"
(?<!\$)\d+  →  matches "100" in "abc100", not "100" in "$100"

Non-Capturing Groups

By default, parentheses create capturing groups that store matched text for later use (accessible via $1, \1, or match.groups()). When you only need grouping for structural purposes — such as alternation — use (?: ) to avoid the overhead of capturing:

# Capturing group: stores "dog" or "cat" as $1
^(dog|cat) food$

# Non-capturing group: groups without storing
^(?:dog|cat) food$

Greedy vs Lazy Quantifiers

By default, *, +, and {n,m} are greedy — they match as many characters as possible. Appending ? makes them lazy, matching as few characters as possible. This distinction is critical when the pattern appears multiple times in a string:

# Greedy: matches from first <p> to last </p>
<p>.*<\/p>

# Lazy: matches each <p>...</p> pair individually
<p>.*?<\/p>

# Input: "<p>First</p><p>Second</p>"
# Greedy matches: "<p>First</p><p>Second</p>" (one match)
# Lazy matches:   "<p>First</p>" and "<p>Second</p>" (two matches)

Regex in Your Favorite Languages

Regex syntax is largely portable, but each language has its own API for applying patterns. Here is how to use the patterns from this tutorial in four popular languages:

// JavaScript
const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
console.log(emailRegex.test("[email protected]"));  // true
console.log("Contact: [email protected]".match(emailRegex));

# Python
import re
email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
print(bool(re.match(email_regex, "[email protected]")))  # True
# Extract all emails from text:
text = "Email [email protected] and [email protected]"
print(re.findall(email_regex, text))  # ['[email protected]', '[email protected]']

// Java
import java.util.regex.*;
Pattern emailRegex = Pattern.compile(
    "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
);
Matcher m = emailRegex.matcher("[email protected]");
System.out.println(m.matches());  // true

# Ruby / grep / sed
echo "[email protected]" | grep -E '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

Conclusion

Regular expressions are one of the most valuable skills a developer can learn. A handful of patterns — character classes, quantifiers, anchors, and groups — cover the vast majority of real-world use cases. Start with simple patterns and build complexity incrementally, testing each step as you go. The patterns in this tutorial are production-ready and will serve you well across form validation, log parsing, data extraction, and search-and-replace workflows. Ready to test your own regex? Use our free online regex tester to write, test, and debug your regular expressions with real-time matching and explanation.

What Are Regular Expressions

Regex Fundamentals: The Building Blocks

1. Literal Characters

2. Metacharacters

3. Shorthand Character Classes

Practical Regex Patterns You Can Use Today

Email Validation

URL Matching

Phone Number Matching (International)

Strong Password Validation

Date Format Extraction (YYYY-MM-DD)

Beyond the Basics: Advanced Techniques

Lookaheads and Lookbehinds

Non-Capturing Groups

Greedy vs Lazy Quantifiers

Regex in Your Favorite Languages

Conclusion

Related Articles

How to Format JSON Like a Pro

Markdown vs HTML: A Comparison Guide

View All Articles