Introduction to Regular Expressions

Regular expressions are a mysterious, extremely useful programming tool. The syntax of Regular Expressions(RegEx) can be extremely daunting for those starting out. Additionally, I've found a lot of RegEx intro guides to be extremely vague, which personally turned me off to learning RegEx for many months. For a while, my "knowledge base" of RegEx consisted of copying and pasting Regular Expression snippets as needed. However, this guide will introduce you to the basics of Regular Expressions in a hopefully extremely accessible manner.

What are Regular Expressions Good For?

Regular Expressions, by definitions, are symbolic patterns that describe text. Regular Expressions serve many extremely useful purposes, including:
  • Formatting text
  • Extracting substrings from a string
  • Finding and replacing characters that are not strictly formatted
  • Validating form data
  • Matching text patterns
When it comes to web development, form validation tends to be the primary reason why a developer turns to Regular Expressions. Hopefully this tutorial, and an upcoming tutorial discussing the use of Regular Expressions in your work flow, will shake off the notion that Regex == Form Validation.

Conventions in this Article

For this tutorial, we'll use a generic syntax. Note that languages deal with escape characters and back references differently. Remember that whenever you're applying this knowledge to your language of choice. For your reference:
  • PHP Regex Syntax
  • JavaScript Regex Syntax
  • Regular expression patterns will be denoted with forward slashes (//).

    Thinking Like a RegEx Engine

    Using Regular Expression requires a great amount of abstract thinking. The most important concept to understand is how a RegEx engine works. You'll find it easier to think about Regular Expressions if you, as silly as it sounds, pretend you're a robot! For example, "cat" to us means a furry, lovable creature. Cat But instead of seeing "cat" as a word, think of cat as the character 'c', followed immediately by the character 'a', followed immediately by the character 't'. In Regular Expressions, the position of each character is extremely important.

    Regular Expression Syntax

    Now that we have the hypothetical-philosophical-BS out of the way, let's make some regular expressions!

    Literal Characters

    Literal characters are simply characters without any modifications. Unless you specify otherwise, the RegEx engine will treat all characters as case-sensitive. Thus the pattern /a/ matches the last character in "Agenda", but not the first character, 'A'. Lets say we were parsing some HTML, and we wanted to know if there was an image tag anywhere in the file. We could use the pattern /<img/ since we know that there shouldn't be a space between the open bracket and the element name in HTML. See img tag example

    Escaping Special Characters

    There are numerous characters set aside for special operations in Regular Expressions. These characters include:
    • [
    • ]
    • (
    • )
    • ?
    • *
    • +
    • ^
    • $
    • \
    • .
    • {
    • }
    If you want to express the literal version of these characters, precede the character with a backslash \. That means to represent a backslash itself, you will need two backslashes. For example, lets say you were looking for hotwire.com in a text file. The pattern /hotwire.com/ does indeed match the string "hotwire.com", but also matches "hotwirehcom", "hotwireLcom", etc, because the . symbol un-escaped can represent ANY character. The correct pattern would be /hotwire\.com/. See hotwire.com example

    Character Classes

    Literal characters are hard-coded and stoic. Consequently, literal characters on their own usually don't provide much utility to the developer. However, the dynamic nature of character classes opens up a world of possibilities. Character classes are characters denoted between square brackets that represent a number of options for a particular character slot. For example, the pattern /h[ao]t/ matches "hat" and "hot". The characters 'a' and 'o' are said to be within a character class. No matter how many characters are in the character class, only one will be used. The RegEx engine iterates through every character in the character class to see if there is a potential match with the string being tested. Thus, /h[abcdefghijklmnopqrstuvxxyz]t/ will still match "hat" and "hot", but will now also match "hbt", "hct", etc. Character classes only take up one character "slot". Thus, it's quite common to see one character class after another character class. As another example, take the pattern /h[ao][pt]/. This pattern matches:
    • hap
    • hat
    • hot
    • hop
    See example

    Ranges in Character Classes

    Within a character class, you can define a range of characters for the RegEx engine to match using the - operator. For example, the pattern /[a-c][0-2]/ matches:
    • a0
    • a2
    • b2
    • c1
    And more. Additionally, multiple ranges can be specified within one character class. It's important to remind yourself that no matter what, only one character will be matched by the character class. The pattern /[a-zA-Z0-9]/ matches a single alphanumeric character. The pattern is equivalent to /[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]/ Since - indicates a range of characters, it is a special character within character classes. There are two ways of escaping the - in the event you need to use the character as a possible match within the character class.
    1. Escape the - with a backslash. (\-)
    2. Have the - at the very end.
    So a pattern for simple 1 digit mathematical expressions would be /[0-9][*/+-][0-9]/ The pattern would match
    • 1+1
    • 3-5
    • 5*9
    And more. See math expression example

    Special Characters in Character Classes

    Character classes have their own distinct set of special characters, separate from the special characters listed above. The special characters for character classes are:
    • \ (escape character)
    • ^ (negation)
    • - (range)
    • ] (character class end)
    To have the character class match any of these as literal characters, they must be preceded with a backslash (\). See example for escaping special characters

    Character Class Negation

    At times you may want to select all characters except a specified few. In that case, placing a caret (^) immediately after the opening bracket of the character class means the character class will match any characters except the remaining characters in the character class. For example, the pattern /hi[^a-zA-Z0-9]/ matches:
    • hi!
    • hi$
    • hi,
    See negated character class example

    Predefined Character Classes

    Most languages come with many useful predefined character classes.
    Predefined character classMatches
    \dA digit
    \wAn alphanumeric character or underscore
    \sA whitespace character
    \DA non digit
    \WA non alphanumeric character or underscore
    \SA non whitespace character

    Quantifiers

    So far we've discussed literals and character classes, both of which only match a single character. What if we want to have the same character class repeated? That's where quantifiers come in.
    QuantifierDescription
    ?0 or 1 occurrences (optional)
    *0 or more occurrences
    +1 or more occurrences
    {x}Exactly x number of occurrences
    {x, y}Between x and y number of occurrences
    {x,}At least x number of occurrences
    If you're wanting to quantify a group of literals, you will need to wrap them in parenthesis. For example, the pattern /(piggy ?){3}/ matches
    • piggy piggy piggy
    • piggypiggypiggy
    • piggypiggy piggy
    The question mark indicates that the space after the y is optional. However, the pattern of a p followed by i, followed by g, followed by g, followed by y, followed by an optional space, must be repeated 3 times. So how would we match the format of the following US phone number: 903-555-5555? We observe that the phone number is 3 digits, followed by a dash, followed by 3 digits, followed by a dash, followed by 4 digits. Thus, the pattern would be /\d{3}-\d{3}-\d{4}/ See phone number example

    Bar operator

    Additionally, if you wish to choose between different sequences of multiple characters, the bar operator (|) serves as a means to extract one of the sequences. For example, the pattern /Do not (eat|hit|fight|scratch) the cat/ matches
    • Do not eat the cat
    • Do not hit the cat
    • Do not fight the cat
    • Do not scratch the cat
    See the cat example

    Resources

    There are some great resources for learning regular expressions on the web. As you've seen throughout the article, I've referenced RegexPal many times. RegexPal is an invaluable tool for testing your regular expressions. As for books, I highly recommend the e-book available from Regular-Expressions.info. For only $5, you can get one of the most comprehensive publications on Regular Expressions available.

    Next Article

    My next article over RegEx will cover back-references and anchors, two critical concepts related to Regular Expressions. Be sure to subscribe so you don't miss it! January 05, 2011
About the Author:

Joseph is the lead developer of Vert Studios Follow Joseph on Twitter: @Joe_Query
Subscribe to the blog: RSS
Visit Joseph's site: joequery.me