Back References, Quantifiers, and Anchors in RegEx

This article will cover some intermediate concepts concerning regular expressions. If you're not acquainted with the basics of regex, check out our article, Introduction to Regular Expressions. When you begin delving further into regular expressions, you find just how useful they are. The combination of back references, quantifiers, and anchors causes RegEx to be one of the most powerful tools in your arsenal.

Conventions

Throughout this article, all patterns will be delimited by slashes (/pattern/). Note that the slashes are not part of the pattern, but merely indicate where the pattern begins and ends. That's why you will not see the slashes in RegexPal examples provided throughout the article. For convenience, in some examples involving anchors, ^$ match at line breaks. In other words, each new line for those examples simulates a different string. Be sure to keep this in mind while learning the concepts.

Quantifiers(Revisited)

Our introduction to regular expressions article discussed the basics of quantifiers. For your convenience, here is the table from that article.

Quantifier	Description
?	0 or 1 occurrences (optional)
*	0 or more occurrences
+	1 or more occurrences
{x}	Exactly x number of occurrences
{x, y}	Between x and y number of occurrences
{x,}	At least x number of occurrences

Also recall that in order to apply a quantifier or bar operator to a sequence of characters, you'll need to use parenthesis. The pattern /face(b|cr)ook/ matches

facebook
facecrook

See the facebook example

Example: Constructing an Email Pattern

Quantifiers are extremely useful when combined with character classes. Consider the most basic email address: Name1@domain.com We'll define an email address as:

2 or more alphanumeric characters, followed by the @ symbol, followed by 2 or more alphanumeric characters, followed by a period, followed by at least 2 alphabetic characters.

(This definition of an email pattern should not be used in a real application. This simple and strict definition is just easier to learn with. ) So valid email addresses would be joseph@vertstudios.com, myEmail9090@haha1.com The following would be invalid email addresses: j0$eph@@vertstudios.com, justin@vertstudios Using our quantifier definitions, let's build this pattern line by line using our definition.

2 or more alphanumeric characters

An alphanumeric character is any letter of the alphabet (both lowercase and upper case) or any number between 0 and 9. This can be represented by [a-zA-Z0-9]. Refer to the introduction article if you're unfamiliar with ranges in character classes. So we need an alphanumeric character to repeat at least twice. Referring to the quantifier table above, we see that the last row contains the quantifier that suits this need. So, in order to specify at least two alphanumeric characters, the pattern would be

/[a-zA-Z0-9]{2,}/

followed by the @ symbol

Since the @ symbol is not a special character, adding the symbol immediately after our previous pattern will require the @ character to be placed after the alphanumeric characters.

/[a-zA-Z0-9]{2,}@/

followed by 2 or more alphanumeric characters

Notice we already defined the pattern for 2 or more alphanumeric characters. Simply append that to what we have so far.

/[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}/

followed by a period

Since a period is a special character, we must escape it.

/[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}\./

followed by at least 2 alphabetic characters

Now we want at least 2 alphabetic characters. This will represent the com, edu, etc. This pattern will be similar to our alphanumeric pattern used earlier, but without the numbers in the character class. After including this pattern, we arrive at our final product.

/[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}\.[a-zA-Z]{2,}/

View our email example

Negated Character Classes and Quantifiers

One of the most amazing 1-2 punches you can throw as a developer using RegEx is the union of negated character classes and quantifiers.

Example: Matching HTML Elements

Say you wanted to match all the opening and closing tags of some HTML. There are a TON of characters that are allowed to be in an HTML document, so attempting to match the tags through a normal character class would be extremely cumbersome. However, we can easily avoid this issue by logically analyzing an HTML tag.

What is an HTML tag?

HTML tags begin with < and end with >. For example, the line break tag <br />. Since > denotes the end of the tag, it makes sense to say that everything in between < and > is a character that is not >. Thus, in order to match the contents of an HTML tag, the following pattern is used:

/<[^>]+>/

This pattern says: "Match a < followed by one or more of any character that is not a >, followed by >. " See Match all HTML Tags Example If you wanted to be a bit more specific and find XHTML non-paired tags (such as <link /> and <img />) you could modify the pattern to include the slash before the closing tag.

/<[^>]+/>/

See Self-closing XHTML tags Example Using the logic of these examples, you can expand your regular expressions into tools of absolute efficiency.

Anchors

Sometimes where you want to begin and end the search for a pattern is as important as the content of the pattern itself. This is especially true concerning validation of user input. Consider our regex for an email that we created earlier:

/[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}\.[a-zA-Z]{2,}/

If we were processing user input from a form through JS or PHP, the following would match the pattern:

joseph@vertstudios.com
joseph@vertstudios.com IS A LOSER
(XSS Here!) joseph@vertstudios.com

Without implementing an anchor, our pattern can be applied anywhere in the string we're testing. See email regex without anchors

Start of String Anchor

When used at the beginning of the pattern and outside of a character class, the caret (^) "matches" the position before the first character in the string being tested. In other words, the pattern will find a match only if the first character after ^ is matched. Unlike how /[a-z]/ matches a lowercase letter, the caret does not represent a value (unless multi-line mode is enabled). For example, /^a/ matches the first character in the string "about" since the first character in the string "about" is the lowercase letter a. /^c/ does NOT match anywhere in the string "acute" since c is not the first character of the string. As another example, say you want to match strings that have only alphanumeric characters, but begin with a letter. The pattern would be

/^[a-zA-Z][a-zA-Z0-9]*/

See beginning anchor example Note: ^$ match at line breaks to simulate multiple strings

End of String Anchor

Similar to the start of string anchor, when the dollar sign symbol ($) is placed at the end of the pattern and outside of a character class, it matches the position after the last character in the string tested. In other words, the pattern will find a match only if the pattern ends with the character before $. Similar to the caret, $ does not explicitly represent a character unless multi-line mode is enabled. For example, /[0-9]$/ matches "h20" since the last character is a number. However, that pattern will not match the string "You have 5 dollars" since it does not end in a digit. See ending anchor example Note: ^$ match at line breaks to simulate multiple strings

Combining the Anchors

By placing the ^ at the beginning of the Regex pattern and $ at the end, you're essentially saying that the string being tested against the pattern must not only contain the pattern, it can't contain any more than what the pattern specifies. It has to match the pattern exactly: Nothing more, nothing less. As an example, let's examine the following pattern:

/^[0-9]{3,} +[a-zA-Z .]+$/

This pattern reads: A string that begins with at least 3 numbers, followed by 1 or more spaces, ending with 1 or more alphanumeric characters, spaces, or periods. Look familiar? Consider the string "1725 North Haynie Avenue". It starts with at least 3 numbers, followed by a space, ending with a sequence of alphabetic characters and spaces. We see that the regex pattern can be interpreted as a very loose definition for a US street address.

Combining Anchors is Imperative for Validation

If you are validating user input from a form, you MUST use both anchors. Let's consider our Email example again. Recall how an email address contained anywhere in our test string would trigger a match. Thus, a bot or mischievous developer can embed more than the characters of an email address into the form field. To prevent this, we adjust our regex to include anchors. Before:

/[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}\.[a-zA-Z]{2,}/

See email validation without anchors After:

/^[a-zA-Z0-9]{2,}@[a-zA-Z0-9]{2,}\.[a-zA-Z]{2,}$/

See email validation with anchors Note the differences in the two examples: Before the anchors were in place, every string matched the pattern. However, after the anchors were placed in the pattern, only the top string matched an email address. Make sure to mind your anchors, otherwise the usefulness of the rest of your pattern can easily be nullified.

Back References

Now that we've discussed quantifiers and anchors, let's combine our current knowledge with the concept of back references. Back references allow you to refer to the characters matched earlier in the regex pattern definition. Back references are always defined using parenthesis. For this tutorial, the backslash (\) will be used to call back references. Check the syntax of your development language to see how to call back references. For example, consider the regex pattern

/^([0-9])\s*=\s*\1$/

The pattern reads: Match a string that begins with a single digit, followed by 0 or more spaces, followed by the equals sign, followed by 0 or more spaces, ending with the digit matched at the beginning of the string. The following strings match the pattern:

1=1
2= 2
3 = 3

The following strings do not match the pattern:

10 = 10
5=3

See equality back-reference example Note that what we wrapped what we wanted to refer back to in parenthesis. To refer to the matched characters later in the pattern, we use \ and the number that corresponds to the reference. Reference numbers begin with 1, and usually go up to 9. The first set of parenthesis starting from the left is referred to with a 1. The next set, 2, and so on. To demonstrate the usage of multiple back references, consider this pattern that describes the commutative property.

/^([0-9])\+([0-9])=\2\+\1$/

The pattern reads: Match a string that begins with a single digit, followed by a plus sign, followed by a single digit, followed by an equal sign, followed by the second digit matched, followed by a plus sign, ending with the first digit matched. This pattern will match

5+3=3+5
2+1=1+2
9+7=7+9

This pattern will not match

9+3=2+1
9=9
3+3=3+3+0

See commutative property example

My next article over Regex will describe the use of Regular Expressions in IDEs. It will provide plenty of great examples over combining all the concepts discussed in this article. Be sure to Subscribe so you don't miss it! January 07, 2011

About the Author:

Joseph is the lead developer of Vert Studios Follow Joseph on Twitter: @Joe_Query
Subscribe to the blog: RSS
Visit Joseph's site: joequery.me

About

Work

Services

Blog

Contact Us

Top Articles ⇒