Version

    Data set manipulation steps

    Add column

    Add column step allows you to add a new empty column of given data type to the specified position in your data set. All rows will have No value in the new column (the value of the column in each row will be null). To populate the column with data, use other steps.

    Parameters
    • New column name: required, the name of your desired column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_".

    • New column type: required, the data type the new column (string, date, integer, decimal, boolean).

    • Position of the new column: defines where the new column is added:

      • As the last column: insert the column after the last column in the data set.

      • As the first column: insert the column before the first column in the data set.

      • After column: insert the column after selected column in the data set.

    Examples

    To add a decimal column called Late fee after Invoice amount column, use the following settings:

    add column
    Duplicate column

    Duplicate column step can be used to create a copy of an existing column - values and formatting are both copied.

    Parameters
    • Source column: required, the column you wish to duplicate.

    • Name of the target column: required, the name you wish to give the new column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_". Note that the new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.

    When a condition is applied to this step, it takes on a copy function and copies rows that meet the condition from the source column to the target column. For more information on step and group conditions, see Step and group conditions.
    Examples

    To create a copy of Due date column, configure the step like this:

    duplicate column
    Remarks
    • The new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.

    • New column will have the same type as the input column. It will also inherit the display format. If you change the display format of the input column even after the step was added, the format will be copied to the duplicate column. If you change display format of the duplicate column, it will not be copied to input column.

    Delete column(s)

    Delete column(s) step allows you to delete columns from the data set.

    Parameters
    • Select column(s) to be deleted: required, select the column(s) to be removed or kept (depending on the Mode of operation) from the data set.

      • Display the list of columns by clicking in the Select column(s) field.

      • To quickly find the desired column, start typing the column name.

      • Select all the columns to be deleted or kept by clicking on them.

      • To unselect a column either click on the column name in the column list or click on X on the right side of the column name in the Select column(s) to be deleted field.

      • To quickly select or unselect all the columns, click on the check box next to the column name lookup field.

    • Mode of operation: by default, the selected columns are deleted.

      • Remove selected columns, keep the others (default option)

      • Keep selected columnns, remove all others

    Examples

    To delete the Type and Contact columns, configure the step like this:

    delete column

    To delete all columns but the Type and Contact columns, configure the step like this:

    delete column keep
    Rename column

    Rename column step allows you to change the name of a column in your data set. After the step, you will have to use the new name to refer the column.

    Parameters
    • Original Name: required, a column to rename (chosen from the dropdown list).

    • New Name: required, the name you wish to give to your column.

    Examples

    To change name of the Invoice no column to Invoice number use the following settings:

    rename column
    Remarks
    • New name of the column can contain special characters (like spaces). New technical name for the column will be derived by replacing those with underscore character "_".

    • If a column you are renaming no longer exists in your data set (this can happen either if you changed the steps before Rename column step or if your data source changed), the step will fail. You will get an error similar to Column 'Column_to_rename' doesn’t exist.

    Reorder columns

    Reorder columns step allows you to change ordering of columns within the data set. It does not modify the data in way, except for changing the relative positions of the columns within the data set.

    Parameters
    • Columns: required, a widget that allows you to reorder columns using the arrows to the left of the column list. Following actions are available in the list:

      Icon Action

      reorder icon up up

      Move the selected column(s) to the beginning of the data set.

      reorder icon up

      Move the selected column(s) by one towards the beginning of the data set ("up" in the list of columns).

      reorder icon down

      Move the selected column(s) by one towards the end of the data set ("down" in the list of columns).

      reorder icon down down

      Move the selected column(s) to the end of the data set.

      You can select multiple columns at once with Ctrl+click (or Control+Click on Mac). You can unselect one of the selected columns in the same way - with Ctrl+Click.

    Examples

    To change the order of columns so that Invoice amount column is the second column, use the widget to select Invoice amount and move it using arrows to the second place in the list:

    reorder columns
    Remarks
    • Reorder columns will work also if some of the columns you moved are no longer in the data set (this can happen either if you changed the steps before Reorder columns step or if your data source changed). In such case, the missing columns are ignored by the step and other column movements are performed as configured.

    Merge columns

    The Merge columns step can be used to merge data from multiple columns into a single column.

    Parameters
    • Columns to merge: required, select the columns to be merged in the sequence you want the values combined. You can change the column sequence by drag and dropping by the grip on the left of a column field.

    • Delimiter: optional, specify a delimiter to be used between merged values.

      • Both single characters (including a space) and multiple characters can be used as delimiters.

      • You can also select newline or tab characters as delimiters from the dropdown list.

    • Ignore blanks: checked by default. When checked, empty values are disregarded and not included in the merged values. This parameter is ignored when no delimiter is specified.

    • Output column name: required, enter the name of the new column. The name defaults to New column.

    • Position of the output column:

      • As the last column

      • As the first column

      • After column

    Examples
    Street City Postal Code Delimiter Ignore blanks Result

    Baker Street 221b

    NW1 6XE

    London

    (No delimiter specified)

    Any

    Baker Street 221bNW1 6XELondon

    Baker Street 221b

    No Value

    London

    (No delimiter specified)

    Any

    Baker Street 221bLondon

    Baker Street 221b

    NW1 6XE

    London

    ,

    Any

    Baker Street 221b,NW1 6XE,London

    Baker Street 221b

    No Value

    London

    ,

    Yes (checked)

    Baker Street 221b,London

    Baker Street 221b

    No Value

    London

    ,

    No (unchecked)

    Baker Street 221b,,London

    Baker Street 221b

    "    " (4 spaces) [1]

    London

    ,

    Any

    Baker Street 221b,    ,London

    Baker Street 221b

    NW1 6XE

    London

    \n [2]

    Any

    Baker Street 221b
    NW1 6XE
    London

    Baker Street 221b

    NW1 6XE

    London

    \\ [3]

    Any

    Baker Street\221b\NW1 6XE\London

    1

    Spaces are not considered an empty value.

    2

    You can use escape sequences (\n, \t, etc.) as delimiters.

    3

    To use a backslash itself as a separator, enter two backslashes.

    Split column

    The Split column step can be used to split data from one string column into multiple columns based on a delimiter.

    Parameters
    • Input column: required, select a string column to split.

    • Delimiter: required, specify a delimiter to be used to split the data.

      • Both single characters (including a space) and multiple characters can be used as delimiters.

      • You can also select newline or tab characters as delimiters from the dropdown list.

    • Number of columns: required, enter the number of columns the data should be split into.

      • The newly created columns keep the Input column name and include part 1, part 2, etc., in the column names.

      • If the result is an empty value, No value is displayed. See Working with empty values for more information.

      • If the number of columns to be created is lower than the number of the resulting split values, the last column will include the remaining values in the unsplit form.

    Examples

    Example1: The Phone number column sometimes includes multiple phone numbers, separated by commas.

    • Result of spliting into 2 columns:

    Phone number Phone number part 1 Phone number part 2

    +17755784197,4063492232,504.626.7424

    +17755784197

    4063492232,504.626.7424

    4063492232,504.626.7424

    4063492232

    504.626.7424

    6710488141

    6710488141

    No value

    No value

    No value

    No value

    • Result of splitting into 3 columns:

    Phone number Phone number part 1 Phone number part 2 Phone number part 3

    +17755784197,4063492232,504.626.7424

    +17755784197

    4063492232

    504.626.7424

    4063492232,504.626.7424

    4063492232

    504.626.7424

    No value

    6710488141

    6710488141

    No value

    No value

    No value

    No value

    No value

    No value

    Example 2: The Customer name column needs to be split into 2 columns, with space as the delimiter.

    Customer name Customer name part 1 Customer name part 2

    Joe Michael Smith

    Joe

    Michael Smith

    Jane Smith

    Jane

    Smith

    Anne

    Anne

    No value

    Smith

    Smith

    No value

    Example 3: Address is on multiple lines and it needs to be split into three individual records. You can use \n as the delimiter here. See Remarks for more information.

    Address Address part 1 Address part 2 Address part 3

    Baker Street 221b

    NW1 6XE

    London

    Baker Street 221b

    NW1 6XE

    London

    Remarks
    • The original column is kept in the dataset by default. If you want to remove it, use the Delete column(s) step.

    • You can use escape sequences (\n, \t, etc.) as delimiters.

    • To use a backslash itself as a delimiter, enter two backslashes.

    Filter rows based on formula

    The Filter rows based on formula step allows you to filter your data set. It can keep or remove rows based on the result of the specified formula.

    filter rows explanation
    Parameters
    • Filter action: select what to do with rows for which the formula returns true:

      • Keep rows: default, keep the rows for which the formula returns true. All other rows are removed.

      • Remove rows: remove the rows for which the formula returns true. All other rows are removed.

    • Formula: the formula (condition) which decides whether to keep or remove the rows. It must be a formula that returns true or false.

    Usage

    Filter works by evaluating the formula for each row in your data set. Based on the result (whether the formula returns true or false) and configured Filter action it will decide whether to keep the row or remove it from further processing.

    Filter step never changes the data in any way - it simply removes subset of the data from further processing based on the condition provided by the user.

    Note that when you filter, the preview will show you only subset of the records that are available in the sample. This means that the preview output might be much more limited compared to what you’ll get by running the job in full. See examples below to see this.

    Simple filtering

    To only keep rows in the contacts data set where the contact address is in Alaska, we can use filter configured like this:

    filter basic example

    The result after applying the above step may look like this:

    filter basic example result

    Notice that only 5 rows are shown in the preview. When you run the job, you will get 62 rows in the output:

    filter basic example full run

    This is because when working with interactive job editor, you are only working with a sample that has 1000 rows by default. However, when you run the job, it works with complete data set which in this case has 10000 rows and therefore also contains more than just 5 contacts from Alaska.

    Removing rows with empty values

    To remove rows that have empty value in certain column from your data set, you can configure the step like this:

    filter rows remove empty dob example

    Note that since $dateOfBirth is a date column, we compare with null to determine if the value is present or not.

    The same approach will work for decimal, integer and boolean columns. For string columns, you can use formula isBlank($stringColumn) instead.

    Alternatively, you can also write $stringColumn == null || $stringColumn == "" - this is less generic than using isBlank function as it does not catch values that are composed entirely of spaces.

    Filtering with more complex conditions

    In this example, we will remove all contacts who are younger than 21 years and do not have a phone number.

    filter complex example settings

    In this case we set the Filter action to Remove so that rows matching the condition are removed rather than kept in the data set.

    The formula used in the step is the following:

    dateDiff(today(), $dateOfBirth, year) < 21
    AND
    isBlank($phone)

    In the first line of the formula, we calculate the age of the customer by using a dateDiff function to calculate the difference between two dates combined with today() function to get today’s date. The result is compared with 21 to see if the customer is younger than 21.

    In the third line above we test if the $phone column is empty by using an isBlank function.

    Both parts (age test and phone test) are combined with AND operator so both conditions must be true for the formula to also return true. You can often see this operator written as &&.

    With settings like this, the step will remove small number of customers from the data. To see the full effect of the step, we can run the job to get result like this:

    filter complex example result
    Remarks
    • When working in an interactive job editor, the step works with sample. Full data set may contain more records that satisfy filtering conditions. To get full picture of your data after it has been filtered, run the job via Run job button.

    Sort

    Sort allows you to sort your data based on selected columns and direction (ascending or descending).

    Parameters

    Sort step configuration is defined as a list of sort key settings. Each sorting key is represented by a column and sorting direction. This allows you to easily configure sorting by multiple columns. Each sort key is configured by using two parameters:

    • Sort by: a key column the data set will be sorted by. All column data types are supported for sorting. Each column can appear only once.

    • Sort order: configure the sorting direction. The choices will depend on the data type:

      • For numeric columns (integer and decimal) select either Smallest to largest (ascending) or Largest to smallest (descending order).

      • For string columns select either A-Z (ascending) or Z-A (descending).

      • For date columns select Oldest to newest (ascending) or Newest to oldest (descending).

      • For boolean columns select either False first or True first.

    Examples
    Sorting by single column

    To find most expensive products, we can sort our product data set by product Unit price column in descending order (i.e., Largest to smallest):

    sort single column settings

    The output will look like this:

    sort single column output
    Sorting by multiple columns

    If we wanted to see youngest customers in each state, we can do this by sorting our contacts data set by State column, then by Date of birth (in descending order) and finally we can also sort by Last name and First name to make the output nicer.

    sort multiple columns settings

    The output may look like this:

    sort multiple columns output
    Remarks
    • Each Sort step can have multiple sorting columns. You can add them in any order and then use drag handles to the left of each sorting key column to rearrange the columns.

    • Sorting can take long time to finish on very large data sets.

    • Sorting of string column uses case-sensitive comparison. This means that lower-case letters are considered to be before upper-case letters when sorting in ascending order.

    • Note that if you have numbers stored in a string column, it will not be sorted properly by number value but rather lexicographically. For example, consider the following picture where we have data sample with numbers in string column sorted in "ascending" order (from smallest to largest):

      sort numbers in string column

      To sort these numbers properly, convert the column to integer (or decimal) first:

      sort numbers in string after conversion

      Notice the changed header icon - it shows the column is now integer instead of a string.

    • Error is treated as the lowest value for any data type. Next lowest value is No value (null). For example, the below data set has been sorted based on a date column in ascending (Oldest to newest) order:

      sort errors and nulls

      Notice how both errors are at the top (they are first) regardless of what is written in them, followed by No value and finally the valid data is sorted in ascending order.

    Lookup

    Lookup step performs a lookup based on key columns from your data set. Lookup step allows you to add columns from your lookup data set to your current data set based on a set of key columns.

    Key columns are columns from your data set that will be matched against selected lookup columns. When the values match, columns you selected from lookup data set will be added to your current dataset.

    Lookup step principle

    Lookup step works by searching for values of key columns in lookup data set. If the match is found, selected columns from lookup data set are added to the rows in the main data set. The result of the lookup therefore always has more columns than the main data set. If a match for lookup key column values is not found in the lookup data set, empty values are used for the columns in the output (these will show as No value in the data preview). In more technical terms, this process is called left outer join.

    The principle of a lookup step can be illustrated on the following diagram:

    lookup step diagram
    Figure 61. Lookup step where main data set contains line items, lookup data set contains products and the output data set contains extended line item table with product details added at the end.

    The above diagram uses following lookup settings:

    • Key column is Product code in both data sets.

    • Columns to add from lookup is set to Product name and Product code.

    Notice how the last row of the output does not have any value in Product name and Unit price columns. This is because that line item uses Product code 99 which does not appear in lookup data set.

    Parameters
    • Lookup data source: required, lookup source data set. A data set that is added to the main data set. The lookup data set must be available in My sources before it can be used here. See Example below for more details.

    • Lookup key mapping: required, defines pairs of columns from both data sets whose values must match.

      • Data column represents a column in the main data set.

      • Lookup column represents a column in the lookup data set.

    • Columns to add from lookup: required, configures list of columns from the lookup data set to be added to the current data set. Two options are available:

      • All columns will add all lookup columns including the key columns (which will therefore be duplicated in the result).

      • Selected columns will add only columns selected in the dropdown.

    Example
    Objective

    Add a Exchange rate column to your invoice data and use that exchange rate to calculate invoice amount in USD. The new column should contain exchange rate of a currency respective to USD.

    Exchange rates are provided by a file lookup-example-exch-rates.csv with two columns:

    • Currency code: 3-letter currency code.

    • Exchange rate: exchange rate of 1 USD to the currency defined by Currency code (i.e. how many of the other currency for 1 USD).

    Solution
    1. Create lookup data source via My sources table. The easiest way os to drag & drop the lookup-example-exch-rates.csv file to Drop a file box in upper left corner of of My sources screen.

    2. Preview the currency rates in My sources to ensure the file was parsed properly.

      lookup example exch rates
    3. Create a new job called Invoices in USD. Load the invoices data as the job’s data source.

    4. In the transform editor, add Lookup step (either from toolbar or by searching via Add step button) and configure it as shown on the screenshot.

      lookup example lookup settings
    5. Once the lookup is added, you should see an output like the following screenshot. Notice the No value cells in the column we added - these are there because currency pair USD-USD with exchange rate of 1.0 is not in the exchange rates file.

      lookup example lookup applied
    6. To fix the No value cells, add Replace empty values step and set it to replace empty values with value 1.

    7. Then, to compute the final invoice amount in USD, add a Calculate formula step with the following formula: $Invoice_amount / $Exchange_rate. Finally, to round the data to 2 decimal places, you can add a Round step. Once all the steps are added, you should see an output like this:

      lookup example done
    Remarks
    • Lookup data set must be available in Data Catalog in My sources before it can be used.

    • Lookup data set should not contain duplicate keys. If that is the case, the last value will be used.

    • When there are values in your dataset for which there are no matches in the lookup data set, the output column values for those keys will be set to No value.

    • Lookups are case-sensitive, and the data used as keys in the lookup must match exactly with the data in your data set in order to be matched.


    Calculate formula

    The Calculate formula step calculates the value of a cell using a formula. Formulas can return a value of any type. Formulas can reference other columns using $Column_Name syntax and many built-in functions.

    Learn the basics of writing formulas

    Visit Formulas and calculations to learn how to work with formulas.

    formulas calculations explained

    Parameters
    • Formula: required, the formula to apply to create the target column values. Formula can return a value of any type. Formulas can reference other columns using $Column_Name syntax and many built-in functions.

    • Target column: required, configure the column which will receive the output. Output type will be determined automatically depending on the output type of the formula.

      • Write result to the current column: overwrite the input column with the result.

      • Create new column with name: create a new column with specified name. Name of the new column can contain spaces or special characters - technical column name will be created automatically. The new column will be placed right after the input column.

    Examples
    Objective

    Calculate the late fees for invoices. Late fees are derived from invoice due date and when it was paid. Late fees are added to invoices which are paid more than 15 days after their due date. Late fee is then 0.5% of the amount for each date the invoice is not paid past the due date.

    Solution
    1. Load the invoices data set. it has two columns - Due date and Paid date - which provide information about invoice payments. Paid date is empty if the invoice was not yet paid. You data may look like this:

      formula example1 initial data
    2. You can use dateDiff function in Calculate formula step to calculate how many days are between two dates.

      formula example1 paid after days settings
    3. Once you add the step, you will get output that can look like the following screenshot. Notice the error cells highlighted in red - they show rows where invoices have not yet been paid so it is not possible to calculate the number of days. We’ll clean this in next steps.

      formula example1 paid after days
    4. To clean the errors from previous step, use Clear error cells step. This step will remove the errors and use null (No value) instead. You will get an output like this:

      formula example1 clear errors
    5. To calculate the late fees, we can use formula like this:

      if($Paid_after_days > 15,
         0.005 * $Invoice_amount * ($Paid_after_days - 15),
         0)

      The formula uses if function to only run when Paid after days is greater than 15. If invoice is paid in 15 days or less, we do not add any fees. Otherwise we calculate the fee as 0.5% of the Invoice amount for each day the payment is late.

      formula example1 calc fees
    6. The output will look like this. Notice again that we have errors in the output - this time caused by No value cells in the Paid_after_days column. No value cells cannot be used in calculation and hence an error is raised. We can solve this in two ways:

      • Improve our formula from the previous step with more complex condition to not calculate anything for those rows. This can be done for example like this:

        if($Paid_after_days == null,
           0,
           if($Paid_after_days > 15,
              0.005 * $Invoice_amount * ($Paid_after_days - 15),
              0
           )
        )
      • Use Replace errors step to clean the result after the calculation. This is the approach we’ll use here.

    7. Finally, we can round the output to 2 decimal places to make it nicer. Do not forget to set the display format for the column to get nicer output.

      formula example1 late fees done
    Remarks
    • The target column data type is automatically determined based on the type returned by the formula. If the target column is of a different type, the type of the target column will be changed to match the formula’s return type.