regex tutorial

Regex Beginner Tutorial: Master Regular Expressions from Scratch

UseEasyTool Team Developer Tools Experts
March 15, 2024 8 min read

What Are Regular Expressions

Regular expressions — commonly called regex or regexp — are patterns used to match character combinations in strings. Think of them as a powerful search language: instead of searching for an exact word, you describe a pattern of characters, and the regex engine finds all strings that fit that pattern.

Regex is deeply embedded in nearly every programming language and developer tool. You will find it in JavaScript, Python, Java, C#, PHP, Ruby, Go, shell scripts with grep and sed, text editors like VS Code and Sublime Text, databases, and even spreadsheet formulas. Once you learn regex, you carry a superpower that works everywhere.

The syntax might look intimidating at first — a typical regex like ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ can seem like line noise. But regex is built from a small set of building blocks that fit together logically. This tutorial will walk you through each one.

Regex Fundamentals: The Building Blocks

Every regex is composed from three categories of tokens. Once you understand these, you can read and write any pattern.

1. Literal Characters

These are the simplest building blocks: they match exactly themselves. The regex cat matches the literal substring "cat" — nothing more, nothing less. It will match inside "catalog", "scatter", and "wildcat". Case matters: cat does not match "Cat" unless you enable case-insensitive mode (the i flag).

2. Metacharacters

Metacharacters have special meaning in regex and do not match themselves. The twelve regex metacharacters you need to know are:

Metacharacter Name Matches
.DotAny single character except newline
^CaretStart of string (or start of line in multiline mode)
$DollarEnd of string (or end of line in multiline mode)
*StarZero or more of the preceding token
+PlusOne or more of the preceding token
?Question markZero or one of the preceding token (makes it optional)
{n,m}Curly bracesBetween n and m occurrences of the preceding token
|PipeAlternation (logical OR) — matches left or right side
( )ParenthesesGrouping and capturing
[ ]Square bracketsCharacter class — matches any one character inside
\BackslashEscapes a metacharacter to match it literally

If you need to match a literal metacharacter, escape it with a backslash. For example, \. matches a literal period, and \+ matches a literal plus sign.

3. Shorthand Character Classes

Shorthand classes save typing for common character groupings:

\d  →  [0-9]                Any digit
\D  →  [^0-9]               Any non-digit
\w  →  [a-zA-Z0-9_]         Word character (letters, digits, underscore)
\W  →  [^a-zA-Z0-9_]        Non-word character
\s  →  [ \t\n\r\f\v]        Whitespace (space, tab, newline, etc.)
\S  →  [^ \t\n\r\f\v]       Non-whitespace

These are the backbone of real-world patterns. For example, \d{3}-\d{3}-\d{4} matches a US phone number format like "555-123-4567".

Practical Regex Patterns You Can Use Today

Learning regex theory is important, but nothing beats seeing how patterns solve real problems. Here are battle-tested regex patterns for common validation tasks, with explanations of how each one works.

Email Validation

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Breakdown:

  • ^ — Start of the string.
  • [a-zA-Z0-9._%+-]+ — The local part (before the @): one or more letters, digits, dots, underscores, percent signs, plus signs, or hyphens.
  • @ — The literal at sign.
  • [a-zA-Z0-9.-]+ — The domain name: one or more letters, digits, dots, or hyphens.
  • \. — A literal dot separating the domain from the TLD.
  • [a-zA-Z]{2,} — The top-level domain: two or more letters (e.g., com, org, io, online).
  • $ — End of the string.

URL Matching

^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$

Breakdown:

  • ^https?:\/\/ — Matches http:// or https://. The s? makes the "s" optional.
  • (www\.)? — Optionally matches www. at the start of the domain.
  • [-a-zA-Z0-9@:%._\+~#=]{1,256} — The domain name: 1 to 256 allowed characters.
  • \.[a-zA-Z0-9()]{1,6} — A dot followed by the TLD (1-6 characters).
  • \b — A word boundary to ensure the TLD ends cleanly.
  • ([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$ — The optional path, query string, and fragment.

Phone Number Matching (International)

^\+?[\d\s\-().]{7,15}$

Breakdown:

  • ^\+? — Optional leading plus sign for international prefix.
  • [\d\s\-().]{7,15} — Between 7 and 15 characters consisting of digits, spaces, hyphens, parentheses, or dots. This covers formats like +1 (555) 123-4567, 555.123.4567, and +44-20-7946-0958.
  • $ — End of string.

Strong Password Validation

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*()_+\-=\[\]{};':\"\\|,.<>\/?]).{8,}$

Breakdown:

  • ^ — Start of string.
  • (?=.*[a-z]) — Positive lookahead: at least one lowercase letter exists somewhere ahead.
  • (?=.*[A-Z]) — Positive lookahead: at least one uppercase letter.
  • (?=.*\d) — Positive lookahead: at least one digit.
  • (?=.*[!@#$%^&*()_+\-=\[\]{};':\"\\|,.<>\/?]) — Positive lookahead: at least one special character.
  • .{8,} — Total length of 8 or more characters.
  • $ — End of string.

Date Format Extraction (YYYY-MM-DD)

\b\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])\b

Breakdown:

  • \b — Word boundary.
  • \d{4} — Exactly four digits for the year.
  • - — Literal hyphen.
  • (0[1-9]|1[0-2]) — Month: 01-09 or 10-12.
  • - — Literal hyphen.
  • (0[1-9]|[12]\d|3[01]) — Day: 01-09, 10-29, or 30-31.
  • \b — Word boundary.

Beyond the Basics: Advanced Techniques

Once you are comfortable with fundamental patterns, these techniques unlock significantly more expressive power.

Lookaheads and Lookbehinds

Lookaround assertions let you check for a pattern without consuming characters — the regex engine peeks ahead or behind without moving its current position in the string.

# Positive lookahead: match "q" only if followed by "u"
q(?=u)  →  matches "q" in "queen", not "q" in "Iraq"

# Negative lookahead: match "q" only if NOT followed by "u"
q(?!u)  →  matches "q" in "Iraq", not "q" in "queen"

# Positive lookbehind: match digits only if preceded by "$"
(?<=\$)\d+  →  matches "100" in "$100", not "100" in "abc100"

# Negative lookbehind: match digits NOT preceded by "$"
(?<!\$)\d+  →  matches "100" in "abc100", not "100" in "$100"

Non-Capturing Groups

By default, parentheses create capturing groups that store matched text for later use (accessible via $1, \1, or match.groups()). When you only need grouping for structural purposes — such as alternation — use (?: ) to avoid the overhead of capturing:

# Capturing group: stores "dog" or "cat" as $1
^(dog|cat) food$

# Non-capturing group: groups without storing
^(?:dog|cat) food$

Greedy vs Lazy Quantifiers

By default, *, +, and {n,m} are greedy — they match as many characters as possible. Appending ? makes them lazy, matching as few characters as possible. This distinction is critical when the pattern appears multiple times in a string:

# Greedy: matches from first <p> to last </p>
<p>.*<\/p>

# Lazy: matches each <p>...</p> pair individually
<p>.*?<\/p>

# Input: "<p>First</p><p>Second</p>"
# Greedy matches: "<p>First</p><p>Second</p>" (one match)
# Lazy matches:   "<p>First</p>" and "<p>Second</p>" (two matches)

Regex in Your Favorite Languages

Regex syntax is largely portable, but each language has its own API for applying patterns. Here is how to use the patterns from this tutorial in four popular languages:

// JavaScript
const emailRegex = /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
console.log(emailRegex.test("[email protected]"));  // true
console.log("Contact: [email protected]".match(emailRegex));

# Python
import re
email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
print(bool(re.match(email_regex, "[email protected]")))  # True
# Extract all emails from text:
text = "Email [email protected] and [email protected]"
print(re.findall(email_regex, text))  # ['[email protected]', '[email protected]']

// Java
import java.util.regex.*;
Pattern emailRegex = Pattern.compile(
    "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
);
Matcher m = emailRegex.matcher("[email protected]");
System.out.println(m.matches());  // true

# Ruby / grep / sed
echo "[email protected]" | grep -E '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

Conclusion

Regular expressions are one of the most valuable skills a developer can learn. A handful of patterns — character classes, quantifiers, anchors, and groups — cover the vast majority of real-world use cases. Start with simple patterns and build complexity incrementally, testing each step as you go. The patterns in this tutorial are production-ready and will serve you well across form validation, log parsing, data extraction, and search-and-replace workflows. Ready to test your own regex? Use our free online regex tester to write, test, and debug your regular expressions with real-time matching and explanation.

Related Articles

How to Format JSON Like a Pro

Learn the best practices for formatting JSON data, including indentation, key sorting, and validation.

Markdown vs HTML: A Comparison Guide

Understand the differences between two markup languages and when to use each one.

View All Articles

Browse our complete collection of developer tutorials, guides, and tips.