Version

    Regular Expressions

    Regular expressions, often abbreviated as regex, are a powerful tool used to search and manipulate text (string) data. They provide a concise way to define patterns within strings, allowing for efficient matching and extraction of specific information. While the underlying syntax might appear complex, understanding the core concepts empowers even non-technical users to leverage this valuable technique.

    Regex basics

    Due to the complexity and variability of regular expressions, it is not possible to cover every single nuance within our documentation. For this reason, this documentation will focus on the most fundamental regex tokens and their core functionalities. This ensures a solid foundation for understanding basic pattern matching and provides a springboard for further exploration of regex’s capabilities.

    If you encounter specific use cases beyond the covered tokens, we recommend referring to advanced online regex resources, e.g. https://regex101.com/. Since our regex implementation is Java-based, make sure to use the Java-based regex flavor to ensure desired results.

    Regex searches are case-sensitive by default. Also, your search might match parts of other words. Bear this in mind and write your search exactly how you want it to match, covering all the needed scenarios. If you don’t care about uppercase or lowercase, you can use the (?i) flag to make the search case insensitive.
    Regex token Purpose Description Example Explanation

    Find exact match

    Matches the exact form of the word.

    product

    It will match product, products, or productive but not Product or PRODUCT.

    ?

    0 or 1

    A question mark (?) specifies that the preceding element can appear 0 or 1 time.

    colou?r

    It will match both color and colour. It will not match coloor or colouur.

    *

    0 or more

    An asterisk (*) specifies that the preceding element can appear 0 or more times.

    colou*r

    It will match color, colour, or colouur.

    +

    1 or more

    A plus sign (+) specifies that the preceding element can appear 1 or more times.

    colou+r

    It will match colour or colouur.

    .

    Wildcard dot

    A dot (.) matches any single character.

    b.t

    It will match bit, bat, but, but also b-t, b t, or b*t.

    |

    Alternative

    The pipe (|) functions as the OR operator.

    USA|UK

    It will match USA or UK.

    ()

    Grouping

    Parentheses () act like brackets in math. They group a set of characters or elements into a single unit.

    col(our|or)

    col(our|or) will match colour or color.

    Without parentheses, colour|or will match colour or or.

    []

    Character classes

    Square brackets [] enclose a group of characters, indicating a match with any character within the set.

    Can also be used to exclude a group of characters when preceded by a caret (^).

    • [0-9]

    • [a-z]

    • [A-Z]

    • [a-zA-Z]

    • [aeiou]

    • [^aeiou]

    • [0-9] will match any digit.

    • [a-z] and [A-Z] will match any lower-case or upper-case letter, respectively.

    • [a-zA-Z] will match any lower-case or upper-case letter.

    • [aeiou] will match any instance of a, e, i, o, u.

    • [^aeiou] will match any character except for a, e, i, o, u.

    {}

    Repeating characters

    Curly braces {} specify the number of repetitions of the preceding element. Combine it with parentheses to apply to a string of characters.

    • {number} - exact number of repetitions

    • {number,} - minimum number of repetitions

    • {number,number} - range of repetitions

    • who{1}p

    • who{1,}p

    • who{1,2}p

    • (mur){1,2}

    • who{1}p will match whop only.

    • who{1,}p will match whop, whoop, or whooop.

    • who{1,2}p will match whop or whoop, but not whooop.

    • (mur){1,2} will match both mur and murmur.

    \s and \S

    Whitespace-related tokens

    • \s - any whitespace character

    • \S - any non-whitespace character

    " Hello "

    • \s will match 2 whitespace characters (two spaces).

    • \S will match 5 characters (letters in the word Hello).

    \d and \D

    Digit-related tokens

    • \d - matches any digit

    • \D - matches any non-digit character, including whitespace characters

    123 Main Street, Anytown, CA 94108

    • \d will match 8 characters: 123 and 94108.

    • \D will match 26 characters, including all whitespace characters and commas:  Main Street, Anytown, CA .

    \w and \W

    Word-related tokens

    • \w - matches any word character (equivalent to [a-zA-Z0-9_]).

    • \W - matches any non-word character (equivalent to [^a-zA-Z0-9_]).

    123 Main Street, Anytown, CA 94108

    • \w will match 27 characters: 123MainStreetAnytownCA94108.

    • \W will match 7 characters (all whitespace characters and commas).

    ^ and $

    Start and end of strings

    Caret (^) or dollar signs ($) can be used to limit your search to the start or end of a string.

    • ^SH7890

    • SH7890$

    • ^SH7890$

    • ^SH7890 will match product codes begging with SH7890, e.g., SH7890456 or SH7890BA1.

    • SH7890$ will match product codes ending in SH7890, e.g., 415YSH7890 or 464SH7890.

    • ^SH7890$ will match SH7890 only.

    Regular expressions use examples
    Example 92. Matching dates

    Create a basic formula that would find dates in the format DD/MM/YYYY (format matching 2 digits/2 digits/4 digits).

    Click to reveal the formula and explanation

    The regex formula would be: \d{2}/\d{2}/\d{4}.

    Explanation:

    • \d{2}: This matches exactly two digits (\d represents any single digit, and {2} specifies it must occur twice consecutively).

    • /: This matches a literal forward slash character ("/").

    • \d{2}: Similar to the first part, this matches exactly two digits again.

    • /: Another literal forward slash match.

    • \d{4}: This matches exactly four digits. This captures the year (e.g., 2024).

    Example 94. Validating US Postal Codes

    Postal codes need to be in the format ##### (basic 5-digit zip code) or #####-#### (ZIP+4 code).

    Click to reveal the formula and explanation

    The regex formula would be: ^\d{5}(-\d{4})?$.

    Explanation:

    • ^: Matches the beginning of the string.

    • \d{5}: Matches exactly five digits.

    • (-: Matches a literal hyphen character (-) but only if it appears after the first five digits.

    • \d{4})?: Matches an optional group containing four digits (0-9). The question mark ? makes the entire group optional, allowing the hyphen and four digits to be absent.

    • $: Matches the end of the string.

    Example 96. Validating shipping address format

    Create a basic validation that would match the following format: 123 Main Street, Anytown, USA 12345, i.e., it:

    • Starts with a house number (one or more digits).

    • Includes a street name containing letters and spaces.

    • Has a city name containing letters (and optionally also spaces).

    • Specifies the country as "USA" (case-sensitive).

    • Ends with a five-digit zip code.

    Click to reveal the formula and explanation

    The regex formula would be: ^(\d+)\s+([A-Za-z\s]+),\s+([A-Za-z\s]+),\s+USA\s+(\d{5})$

    Explanation:

    • ^: Matches the beginning of the string.

    • (\d+): Captures one or more digits for the house number.

    • \s+: Matches one or more whitespace characters (space, tab, etc.).

    • ([A-Za-z\s]+): Captures the street name, allowing for one or more words separated by spaces.

    • ,: Matches a comma.

    • \s+: Matches one or more whitespace characters again.

    • ([A-Za-z\s]+): Captures the city name, allowing for one or more words separated by spaces.

    • ,: Matches another comma.

    • \s+: Matches one or more whitespace characters again.

    • USA: Matches the literal string "USA" (case-sensitive).

    • \s+: Matches one or more whitespace characters again.

    • (\d{5}): Captures exactly five digits for the ZIP code.

    • $: Matches the end of the string.

    Example 98. Validating email addresses

    Identify email addresses that adhere to the following rules:

    • The local part can contain letters, numbers, underscores, hyphens, and dots.

    • The domain name can have one or more subdomains.

    • Subdomains can contain letters, numbers, and hyphens.

    • The Top-level domain (TLD) must be 2 to 4 characters long and can contain letters only.

    Click to reveal the formula and explanation

    The regex formula would be:

    ^[\w-\.]+@([a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*)\.[a-zA-Z]{2,4}$

    Explanation:

    • ^: Matches the beginning of the string (entire email address).

    • [\w-\.]+: Matches one or more occurrences of the following characters in the local part:

      • \w: Word characters (letters, numbers, and underscores).

      • -: Hyphen.

      • \.: Dot (period).

    • @: Matches the "@" symbol, separating the local part from the domain name.

    • ([a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*): Matches one or more repetitions of the subdomain pattern:

      • [a-zA-Z0-9\-]+: Matches one or more letters (a-z, A-Z), numbers (0-9), and hyphens (-) for a subdomain (excluding underscores).

      • (?:\.[a-zA-Z0-9\-]+)*: Matches zero or more repetitions of a literal dot (.) followed by another subdomain following the same pattern.

    • \.: Matches a literal dot (.) separating the subdomains from the TLD.

    • [a-zA-Z]{2,4}: Matches the Top-Level Domain (TLD) containing:

      • [a-zA-Z]: Letters (a-z, A-Z).

      • {2,4}: Quantifier specifying a length of 2 to 4 characters.

    • $: Matches the end of the string (entire email address).

    Regex for advanced users

    Since the implementation of regular expressions comes from the Java standard library, the syntax of expressions is the same as in Java: see http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

    For a more detailed explanation of how to use regular expressions, see the Java documentation for java.util.regex.Pattern.

    The meaning of regular expressions can be modified using embedded flag expressions. The expressions include the following:

    (?i)Pattern.CASE_INSENSITIVE

    Enables case-insensitive matching.

    (?s)Pattern.DOTALL

    In dotall mode, the dot . matches any character, including line terminators.

    (?m)Pattern.MULTILINE

    In multiline mode, you can use ^ and $ to express the beginning and end of the line, respectively (this includes at the beginning and end of the entire expression).

    Further reading and description of other flags can be found at http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html.