Regular Expressions

CloverDX Designer > CTL2 - CloverDX Transformation Language > Language Reference > Regular Expressions

Regular expressions, often abbreviated as regex, are a powerful tool used to search and manipulate text (string) data. They provide a concise way to define patterns within strings, allowing for efficient matching and extraction of specific information. While the underlying syntax might appear complex, understanding the core concepts empowers even non-technical users to leverage this valuable technique.

Regex basics

Due to the complexity and variability of regular expressions, it is not possible to cover every single nuance within our documentation. For this reason, this documentation will focus on the most fundamental regex tokens and their core functionalities. This ensures a solid foundation for understanding basic pattern matching and provides a springboard for further exploration of regex’s capabilities.

If you encounter specific use cases beyond the covered tokens, we recommend referring to advanced online regex resources, e.g. https://regex101.com/. Since our regex implementation is Java-based, make sure to use the Java-based regex flavor to ensure desired results.

Regex searches are case-sensitive by default. Also, your search might match parts of other words. Bear this in mind and write your search exactly how you want it to match, covering all the needed scenarios. If you don’t care about uppercase or lowercase, you can use the (?i) flag to make the search case insensitive.

Regex token Purpose Description Example Explanation

Regex token	Purpose	Description	Example	Explanation
	Find exact match	Matches the exact form of the word.	`product`	It will match `product`, `products`, or `productive` but not `Product` or `PRODUCT`.
`?`	0 or 1	A question mark (`?`) specifies that the preceding element can appear 0 or 1 time.	`colou?r`	It will match both `color` and `colour`. It will not match `coloor` or `colouur`.
`*`	0 or more	An asterisk (`*`) specifies that the preceding element can appear 0 or more times.	`colou*r`	It will match `color`, `colour`, or `colouur`.
`+`	1 or more	A plus sign (`+`) specifies that the preceding element can appear 1 or more times.	`colou+r`	It will match `colour` or `colouur`.
`.`	Wildcard dot	A dot (`.`) matches any single character.	`b.t`	It will match `bit`, `bat`, `but`, but also `b-t`, `b t`, or `b*t`.
`\|`	Alternative	The pipe (`\|`) functions as the OR operator.	`USA\|UK`	It will match `USA` or `UK`.
`()`	Grouping	Parentheses `()` act like brackets in math. They group a set of characters or elements into a single unit.	`col(our\|or)`	`col(our\|or)` will match `colour` or `color`. Without parentheses, `colour\|or` will match `colour` or `or`.
`[]`	Character classes	Square brackets `[]` enclose a group of characters, indicating a match with any character within the set. Can also be used to exclude a group of characters when preceded by a caret `(^)`.	`[0-9]` `[a-z]` `[A-Z]` `[a-zA-Z]` `[aeiou]` `[^aeiou]`	`[0-9]` will match any digit. `[a-z]` and `[A-Z]` will match any lower-case or upper-case letter, respectively. `[a-zA-Z]` will match any lower-case or upper-case letter. `[aeiou]` will match any instance of `a, e, i, o, u`. `[^aeiou]` will match any character except for `a, e, i, o, u`.
`{}`	Repeating characters	Curly braces `{}` specify the number of repetitions of the preceding element. Combine it with parentheses to apply to a string of characters. `{number}` - exact number of repetitions `{number,}` - minimum number of repetitions `{number,number}` - range of repetitions	`who{1}p` `who{1,}p` `who{1,2}p` `(mur){1,2}`	`who{1}p` will match `whop` only. `who{1,}p` will match `whop`, `whoop`, or `whooop`. `who{1,2}p` will match `whop` or `whoop`, but not `whooop`. `(mur){1,2}` will match both `mur` and `murmur`.
`\s` and `\S`	Whitespace-related tokens	`\s` - any whitespace character `\S` - any non-whitespace character	`" Hello "`	`\s` will match 2 whitespace characters (two spaces). `\S` will match 5 characters (letters in the word `Hello`).
`\d` and `\D`	Digit-related tokens	`\d` - matches any digit `\D` - matches any non-digit character, including whitespace characters	`123 Main Street, Anytown, CA 94108`	`\d` will match 8 characters: `123` and `94108`. `\D` will match 26 characters, including all whitespace characters and commas: `Main Street, Anytown, CA` .
`\w` and `\W`	Word-related tokens	`\w` - matches any word character (equivalent to `[a-zA-Z0-9_]`). `\W` - matches any non-word character (equivalent to `[^a-zA-Z0-9_]`).	`123 Main Street, Anytown, CA 94108`	`\w` will match 27 characters: `123MainStreetAnytownCA94108`. `\W` will match 7 characters (all whitespace characters and commas).
`^` and `$`	Start and end of strings	Caret (`^`) or dollar signs (`$`) can be used to limit your search to the start or end of a string.	`^SH7890` `SH7890$` `^SH7890$`	`^SH7890` will match product codes begging with `SH7890`, e.g., `SH7890456` or `SH7890BA1`. `SH7890$` will match product codes ending in `SH7890`, e.g., `415YSH7890` or `464SH7890`. `^SH7890$` will match `SH7890` only.

Find exact match

Matches the exact form of the word.

product

It will match product, products, or productive but not Product or PRODUCT.

?

0 or 1

A question mark (?) specifies that the preceding element can appear 0 or 1 time.

colou?r

It will match both color and colour. It will not match coloor or colouur.

*

0 or more

An asterisk (*) specifies that the preceding element can appear 0 or more times.

colou*r

It will match color, colour, or colouur.

+

1 or more

A plus sign (+) specifies that the preceding element can appear 1 or more times.

colou+r

It will match colour or colouur.

.

Wildcard dot

A dot (.) matches any single character.

b.t

It will match bit, bat, but, but also b-t, b t, or b*t.

|

Alternative

The pipe (|) functions as the OR operator.

USA|UK

It will match USA or UK.

()

Grouping

Parentheses () act like brackets in math. They group a set of characters or elements into a single unit.

col(our|or)

col(our|or) will match colour or color.

Without parentheses, colour|or will match colour or or.

[]

Character classes

Square brackets [] enclose a group of characters, indicating a match with any character within the set.

Can also be used to exclude a group of characters when preceded by a caret (^).

[0-9]
[a-z]
[A-Z]
[a-zA-Z]
[aeiou]
[^aeiou]

[0-9] will match any digit.
[a-z] and [A-Z] will match any lower-case or upper-case letter, respectively.
[a-zA-Z] will match any lower-case or upper-case letter.
[aeiou] will match any instance of a, e, i, o, u.
[^aeiou] will match any character except for a, e, i, o, u.

{}

Repeating characters

Curly braces {} specify the number of repetitions of the preceding element. Combine it with parentheses to apply to a string of characters.

{number} - exact number of repetitions
{number,} - minimum number of repetitions
{number,number} - range of repetitions

who{1}p
who{1,}p
who{1,2}p
(mur){1,2}

who{1}p will match whop only.
who{1,}p will match whop, whoop, or whooop.
who{1,2}p will match whop or whoop, but not whooop.
(mur){1,2} will match both mur and murmur.

\s and \S

Whitespace-related tokens

\s - any whitespace character
\S - any non-whitespace character

" Hello "

\s will match 2 whitespace characters (two spaces).
\S will match 5 characters (letters in the word Hello).

\d and \D

Digit-related tokens

\d - matches any digit
\D - matches any non-digit character, including whitespace characters

123 Main Street, Anytown, CA 94108

\d will match 8 characters: 123 and 94108.
\D will match 26 characters, including all whitespace characters and commas: Main Street, Anytown, CA .

\w and \W

Word-related tokens

\w - matches any word character (equivalent to [a-zA-Z0-9_]).
\W - matches any non-word character (equivalent to [^a-zA-Z0-9_]).

123 Main Street, Anytown, CA 94108

\w will match 27 characters: 123MainStreetAnytownCA94108.
\W will match 7 characters (all whitespace characters and commas).

^ and $

Start and end of strings

Caret (^) or dollar signs ($) can be used to limit your search to the start or end of a string.

^SH7890
SH7890$
^SH7890$

^SH7890 will match product codes begging with SH7890, e.g., SH7890456 or SH7890BA1.
SH7890$ will match product codes ending in SH7890, e.g., 415YSH7890 or 464SH7890.
^SH7890$ will match SH7890 only.

Regular expressions use examples

Example 92. Matching dates

Create a basic formula that would find dates in the format DD/MM/YYYY (format matching 2 digits/2 digits/4 digits).

Click to reveal the formula and explanation

The regex formula would be: \d{2}/\d{2}/\d{4}.

Explanation:

\d{2}: This matches exactly two digits (\d represents any single digit, and {2} specifies it must occur twice consecutively).
/: This matches a literal forward slash character ("/").
\d{2}: Similar to the first part, this matches exactly two digits again.
/: Another literal forward slash match.
\d{4}: This matches exactly four digits. This captures the year (e.g., 2024).

Example 94. Validating US Postal Codes

Postal codes need to be in the format ##### (basic 5-digit zip code) or #####-#### (ZIP+4 code).

Click to reveal the formula and explanation

The regex formula would be: ^\d{5}(-\d{4})?$.

Explanation:

^: Matches the beginning of the string.
\d{5}: Matches exactly five digits.
(-: Matches a literal hyphen character (-) but only if it appears after the first five digits.
\d{4})?: Matches an optional group containing four digits (0-9). The question mark ? makes the entire group optional, allowing the hyphen and four digits to be absent.
$: Matches the end of the string.

Example 96. Validating shipping address format

Create a basic validation that would match the following format: 123 Main Street, Anytown, USA 12345, i.e., it:

Starts with a house number (one or more digits).
Includes a street name containing letters and spaces.
Has a city name containing letters (and optionally also spaces).
Specifies the country as "USA" (case-sensitive).
Ends with a five-digit zip code.

Click to reveal the formula and explanation

The regex formula would be: ^(\d+)\s+([A-Za-z\s]+),\s+([A-Za-z\s]+),\s+USA\s+(\d{5})$

Explanation:

^: Matches the beginning of the string.
(\d+): Captures one or more digits for the house number.
\s+: Matches one or more whitespace characters (space, tab, etc.).
([A-Za-z\s]+): Captures the street name, allowing for one or more words separated by spaces.
,: Matches a comma.
\s+: Matches one or more whitespace characters again.
([A-Za-z\s]+): Captures the city name, allowing for one or more words separated by spaces.
,: Matches another comma.
\s+: Matches one or more whitespace characters again.
USA: Matches the literal string "USA" (case-sensitive).
\s+: Matches one or more whitespace characters again.
(\d{5}): Captures exactly five digits for the ZIP code.
$: Matches the end of the string.

Example 98. Validating email addresses

Identify email addresses that adhere to the following rules:

The local part can contain letters, numbers, underscores, hyphens, and dots.
The domain name can have one or more subdomains.
Subdomains can contain letters, numbers, and hyphens.
The Top-level domain (TLD) must be 2 to 4 characters long and can contain letters only.

Click to reveal the formula and explanation

The regex formula would be:

^[\w-\.]+@([a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*)\.[a-zA-Z]{2,4}$

Explanation:

^: Matches the beginning of the string (entire email address).
[\w-\.]+: Matches one or more occurrences of the following characters in the local part:
- \w: Word characters (letters, numbers, and underscores).
- -: Hyphen.
- \.: Dot (period).
@: Matches the "@" symbol, separating the local part from the domain name.
([a-zA-Z0-9\-]+(?:\.[a-zA-Z0-9\-]+)*): Matches one or more repetitions of the subdomain pattern:
- [a-zA-Z0-9\-]+: Matches one or more letters (a-z, A-Z), numbers (0-9), and hyphens (-) for a subdomain (excluding underscores).
- (?:\.[a-zA-Z0-9\-]+)*: Matches zero or more repetitions of a literal dot (.) followed by another subdomain following the same pattern.
\.: Matches a literal dot (.) separating the subdomains from the TLD.
[a-zA-Z]{2,4}: Matches the Top-Level Domain (TLD) containing:
- [a-zA-Z]: Letters (a-z, A-Z).
- {2,4}: Quantifier specifying a length of 2 to 4 characters.
$: Matches the end of the string (entire email address).

Regex for advanced users

Since the implementation of regular expressions comes from the Java standard library, the syntax of expressions is the same as in Java: see http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html.

For a more detailed explanation of how to use regular expressions, see the Java documentation for java.util.regex.Pattern.

The meaning of regular expressions can be modified using embedded flag expressions. The expressions include the following:

(?i) – Pattern.CASE_INSENSITIVE: Enables case-insensitive matching.
(?s) – Pattern.DOTALL: In dotall mode, the dot . matches any character, including line terminators.
(?m) – Pattern.MULTILINE: In multiline mode, you can use ^ and $ to express the beginning and end of the line, respectively (this includes at the beginning and end of the entire expression).

Further reading and description of other flags can be found at http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html.

{{{ highlightedName }}}