Version

    Data anonymization steps

    • Add noise to date: replace the date with new random date near the original value to anonymize your date values.

    • Add noise to number: replace the original value with newly generated number that is near the original to anonymize numeric data.

    • Mask text: mask parts of text to hide personal or private data.

    Add noise to date

    The Add noise to date step can be used to anonymize date values. This step produces a new date that is within the specified range around the original value. See more details in How are the new values calculated.

    Parameters
    • Input column: required, select a date column.

    • Noise amount: required, enter the amount and unit by which to change the new date that will be randomly created from the original value. See below for more details about how the new values are generated.

      • The default value is 10 days.

      • Time units can be: days, weeks, months, years, hours, minutes, seconds, or milliseconds.

    • Truncate to date: selected by default. When selected, time units are disregarded and always output as 0:00:00 in the result values.

    • Target column: required, configure the column which will receive the output.

      • Write result to the current column: outputs data into the Input column.

      • Create new column with name: create a new column with the specified name. Name of the new column can contain spaces or special characters - the technical column name will be created automatically. The new column will be placed right next to the Input column.

    How are the new values calculated

    The step performs an operation that can be imagined as randomly shifting the original value by an amount that is smaller than the provided maximum.

    It works by calculating lower and upper bounds for the new value. These are both based on the Noise amount parameter:

    • Lower bound is calculated as $originalValue - noiseAmount

    • Upper bound is calculated as $originalValue + noiseAmount

    The calculation properly accounts for complexities of date math - short/long months, DST "jumps", leap years, etc.

    Once the lower and upper bound is computed, a new random date is picked in the interval between those two values. This effectively means that you get a new random date and time value that is never more than specified amount of time away from the original value.

    Examples

    The example date below is in the yyyy-MM-dd format.

    Original date Noise amount Truncate to date only Calculated range Result amount

    2023-03-01

    10 days

    Any

    ±10 days

    The result date will be between February 19th, 2023 and March 11th, 2023.

    2023-03-01

    10 weeks

    Any

    ±10 weeks

    The result date will be between December 21st, 2022 and May 10th, 2023.

    2023-03-01 08:00:00

    10 weeks

    Yes (checked)

    ±10 weeks

    The result date will be between December 21st, 2022, 0:00:00 and May 10th, 2023, 0:00:00.

    2023-03-01 08:00:00

    10 weeks

    No (unchecked)

    ±10 weeks

    The result date and time will be between December 21st, 2022, 0:00:00 and May 10th, 2023, 23:59:59.

    2023-03-01 08:00:00

    12 hours

    Yes (checked)

    ±12 hours

    The result date and time will be between February 28th, 2023, 0:00:00 and March 1st, 2023, 0:00:00.

    2023-03-01 08:00:00

    12 hours

    No (unchecked)

    ±12 hours

    The result date and time will be between February 28th, 2023, 20:00:00 and March 1st, 2023, 20:00:00.

    No value

    12 hours

    Any

    Not calculated

    No value (the result is an empty value).

    Remarks
    • The dates are randomized again with every job run - you will get new values every time.

    • The step does not modify empty values - they will remain empty even after the step. If you wish to handle these in some other way, please see Working with empty values for more information.

    Add noise to number

    The Add noise to number step can be used to anonymize data in integer or decimal columns. This step generates a new random value that is within specified distance from the original value. See more details in How are the new values calculated.

    Parameters
    • Input column: required, select a decimal or integer column.

    • Noise amount: required, enter a *number* or *percentage* that will be used to calculate the range from which new random numbers is generated.

      • Default value is 10%.

    • Target column: required, configure the column which will receive the output.

      • Write result to the current column: outputs data into the Input column.

      • Create new column with name: create a new column with the specified name. Name of the new column can contain spaces or special characters - the technical column name will be created automatically. The new column will be placed right next to the Input column.

    How are the new values calculated

    The step works by randomly generating a new value that is within the distance specified by the Noise amount parameter from the original value.

    It works by calculating lower and upper bounds for the new value. These are both based on the Noise amount parameter:

    • Lower bound is calculated as $originalValue - noiseAmount if the Noise amount is specified as absolute number or as $originalValue * (100 - noiseAmount) / 100 if the Noise amount is specified as percentage value.

    • Upper bound is calculated as $originalValue + noiseAmount if the Noise amount is specified as absolute number or as $originalValue * (100 + noiseAmount) / 100 if the Noise amount is specified as percentage value.

    Note that in both cases the algorithm properly accounts for values that either too large or too small and will not cause overflows.

    Once the lower and upper bound is computed, a new random number is picked in the interval between those two values. This effectively means that you get a new random number that is never more than specified amount of time away from the original value.

    Examples
    Original amount Noise amount Display format Calculated range Result amount

    2000

    10%

    Any

    ±200 (10% of 2000)

    The result amount will be between 1800 and 2200.

    2,000

    400

    Any

    ±400

    The result amount will be between 1600 and 2400.

    99.99

    10%

    Not set

    ±9.99

    The result amount will be between 90.0000000000 and 109.9800000000.

    99.99

    10%

    #.##

    ±9.99

    The result amount will be between 90.00 and 109.98.

    0

    10%

    Any

    0

    The result amount will be 0.

    0

    10

    Any

    ±10

    The result amount will be between -10 and 10.

    No value

    Any amount or percentage

    Any

    Not calculated

    No value (the result is an empty value).

    Remarks
    • New random values are generated with every job run.

    • When adding noise to decimal values, the number of decimal places in the randomized values depends on the display format of the original column. See Working with decimals for more information on how to change the display format. When no display format is specified, the randomized numbers can have up to 10 decimal places.

    • This step does not affect empty values (see Working with empty values for more information on empty values in Wrangler).

    Mask text

    Mask text step implements data masking algorithms that allow you to hide private data in simple way. For example, you can use it to mask out parts of phone numbers, credit cards, personal names, and so on.

    Parameters
    • Input column: required, a string column containing input text to mask.

    • Characters to mask: required, determines what kind of characters to mask. Following options are available:

      • Letters and digits: default, only mask letters and digits, whitespaces and other characters will remain as they were.

      • Letters only: only mask letters, all other characters will remain as they were.

      • Digits only: only mask digits, all other characters will remain as they were.

      • All characters: mask all characters regardless of their type (this is the most generic option).

    • Masking character: required, a single character that will be used as a replacement for each character selected in the Characters to mask parameter.

    • Mask suffix: configure how many characters from the input value to mask in each group of characters of the same type. Value must be an integer greater than or equal to 0. Two options are available:

      • All: all characters are considered for masking.

      • Number of characters from a string group to preserve unmasked: allows you to specify the number of characters from the beginning of each group of characters that will not be masked. Value must be at least 1.

    • Target column: required, configure the column which will receive the output. Output will always be of string type.

      • Write result to the current column: overwrite the input column with the result.

      • Create new column with name: create a new column with specified name. Name of the new column can contain spaces or special characters - technical column name will be created automatically. The new column will be placed right after the input column.

    Examples
    Input value Characters to mask Masking character Mask suffix Result Description

    "123-456-7890"

    Digits only

    "x"

    All

    "xxx-xxx-xxxx"

    Mask all digits in each group.

    "123-456-7890"

    Letters only

    "x"

    All

    "123-456-7890"

    No change - input value does not contain any letters.

    "123-456-7890"

    All

    "x"

    All

    "xxxxxxxxxxxx"

    Mask all characters - the result will be the same length as input and will consist entirely of the mask characters.

    "123-456-7890"

    Digits only

    "x"

    2

    "12x-45x-78xx"

    Find all digits and from each group of digits keep the first two (based on prefix length) and replace the remaining digits with "x".

    "" (empty string)

    Any

    Any

    Any

    "" (empty string)

    Calling the step on an empty string will return empty string.

    No value

    Any

    Any

    Any

    No value

    Calling the step on No value input will return No value.

    Error

    Any

    Any

    Any

    Error

    Calling the step on an Error will return an Error.

    Remarks
    • Calling the step on a No value string results in No value.

    • Calling the step on a cell with Error will result in an Error.