Data set manipulation steps
-
Add column: add new column to your data set.
-
Duplicate column: copy column into a new column.
-
Delete column(s): remove column fro your data set.
-
Rename column: change name of a column.
-
Reorder columns: change the ordering of columns within the data set.
-
Merge columns: merge values from multiple columns into a single column.
-
Split column: split column based on a delimiter.
-
Filter rows based on formula: keep or remove rows based on result of a formula.
-
Sort: sort data based on selected key columns.
-
Lookup: merge your dataset with another based on key columns.
-
Calculate formula: calculate formula and write result to selected column.
Add column
Add column step allows you to add a new empty column of given data type to the specified position in your data set. All rows will have No value in the new column (the value of the column in each row will be null
). To populate the column with data, use other steps.
Parameters
-
New column name: required, the name of your desired column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_".
-
New column type: required, the data type the new column (string, date, integer, decimal, boolean).
-
Position of the new column: defines where the new column is added:
-
As the last column: insert the column after the last column in the data set.
-
As the first column: insert the column before the first column in the data set.
-
After column: insert the column after selected column in the data set.
-
Examples
To add a decimal column called Late fee
after Invoice amount
column, use the following settings:
Duplicate column
Duplicate column step can be used to create a copy of an existing column - values and formatting are both copied.
Parameters
-
Source column: required, the column you wish to duplicate.
-
Name of the target column: required, the name you wish to give the new column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_". Note that the new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.
When a condition is applied to this step, it takes on a copy function and copies rows that meet the condition from the source column to the target column. For more information on step and group conditions, see Step and group conditions. |
Examples
To create a copy of Due date
column, configure the step like this:
Remarks
-
The new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.
-
New column will have the same type as the input column. It will also inherit the display format. If you change the display format of the input column even after the step was added, the format will be copied to the duplicate column. If you change display format of the duplicate column, it will not be copied to input column.
Delete column(s)
Delete column(s) step allows you to delete columns from the data set.
Parameters
-
Select column(s) to be deleted: required, select the column(s) to be removed or kept (depending on the Mode of operation) from the data set.
-
Display the list of columns by clicking in the Select column(s) field.
-
To quickly find the desired column, start typing the column name.
-
Select all the columns to be deleted or kept by clicking on them.
-
To unselect a column either click on the column name in the column list or click on X on the right side of the column name in the Select column(s) to be deleted field.
-
To quickly select or unselect all the columns, click on the check box next to the column name lookup field.
-
-
Mode of operation: by default, the selected columns are deleted.
-
Remove selected columns, keep the others (default option)
-
Keep selected columnns, remove all others
-
Examples
To delete the Type
and Contact
columns, configure the step like this:
To delete all columns but the Type
and Contact
columns, configure the step like this:
Rename column
Rename column step allows you to change the name of a column in your data set. After the step, you will have to use the new name to refer the column.
Parameters
-
Original Name: required, a column to rename (chosen from the dropdown list).
-
New Name: required, the name you wish to give to your column.
Examples
To change name of the Invoice no
column to Invoice number
use the following settings:
Remarks
-
New name of the column can contain special characters (like spaces). New technical name for the column will be derived by replacing those with underscore character "_".
-
If a column you are renaming no longer exists in your data set (this can happen either if you changed the steps before Rename column step or if your data source changed), the step will fail. You will get an error similar to Column 'Column_to_rename' doesn’t exist.
Reorder columns
Reorder columns step allows you to change ordering of columns within the data set. It does not modify the data in way, except for changing the relative positions of the columns within the data set.
Parameters
-
Columns: required, a widget that allows you to reorder columns using the arrows to the left of the column list. Following actions are available in the list:
Icon Action Move the selected column(s) to the beginning of the data set.
Move the selected column(s) by one towards the beginning of the data set ("up" in the list of columns).
Move the selected column(s) by one towards the end of the data set ("down" in the list of columns).
Move the selected column(s) to the end of the data set.
You can select multiple columns at once with Ctrl+click (or Control+Click on Mac). You can unselect one of the selected columns in the same way - with Ctrl+Click.
Examples
To change the order of columns so that Invoice amount
column is the second column, use the widget to select Invoice amount
and move it using arrows to the second place in the list:
Remarks
-
Reorder columns will work also if some of the columns you moved are no longer in the data set (this can happen either if you changed the steps before Reorder columns step or if your data source changed). In such case, the missing columns are ignored by the step and other column movements are performed as configured.
Merge columns
The Merge columns step can be used to merge data from multiple columns into a single column.
Parameters
-
Columns to merge: required, select the columns to be merged in the sequence you want the values combined. You can change the column sequence by drag and dropping by the grip on the left of a column field.
-
Delimiter: optional, specify a delimiter to be used between merged values.
-
Both single characters (including a space) and multiple characters can be used as delimiters.
-
You can also select newline or tab characters as delimiters from the dropdown list.
-
-
Ignore blanks: checked by default. When checked, empty values are disregarded and not included in the merged values. This parameter is ignored when no delimiter is specified.
-
Output column name: required, enter the name of the new column. The name defaults to New column.
-
Position of the output column:
-
As the last column
-
As the first column
-
After column
-
Examples
Street | City | Postal Code | Delimiter | Ignore blanks | Result |
---|---|---|---|---|---|
Baker Street 221b |
NW1 6XE |
London |
(No delimiter specified) |
Any |
Baker Street 221bNW1 6XELondon |
Baker Street 221b |
No Value |
London |
(No delimiter specified) |
Any |
Baker Street 221bLondon |
Baker Street 221b |
NW1 6XE |
London |
, |
Any |
Baker Street 221b,NW1 6XE,London |
Baker Street 221b |
No Value |
London |
, |
Yes (checked) |
Baker Street 221b,London |
Baker Street 221b |
No Value |
London |
, |
No (unchecked) |
Baker Street 221b,,London |
Baker Street 221b |
" " (4 spaces) [1] |
London |
, |
Any |
Baker Street 221b, ,London |
Baker Street 221b |
NW1 6XE |
London |
\n [2] |
Any |
Baker Street 221b NW1 6XE London |
Baker Street 221b |
NW1 6XE |
London |
\\ [3] |
Any |
Baker Street\221b\NW1 6XE\London |
Split column
The Split column step can be used to split data from one string column into multiple columns based on a delimiter.
Parameters
-
Input column: required, select a string column to split.
-
Delimiter: required, specify a delimiter to be used to split the data.
-
Both single characters (including a space) and multiple characters can be used as delimiters.
-
You can also select newline or tab characters as delimiters from the dropdown list.
-
-
Number of columns: required, enter the number of columns the data should be split into.
-
The newly created columns keep the Input column name and include part 1, part 2, etc., in the column names.
-
If the result is an empty value, No value is displayed. See Working with empty values for more information.
-
If the number of columns to be created is lower than the number of the resulting split values, the last column will include the remaining values in the unsplit form.
-
Examples
Example1: The Phone number column sometimes includes multiple phone numbers, separated by commas.
-
Result of spliting into 2 columns:
Phone number | Phone number part 1 | Phone number part 2 |
---|---|---|
+17755784197,4063492232,504.626.7424 |
+17755784197 |
4063492232,504.626.7424 |
4063492232,504.626.7424 |
4063492232 |
504.626.7424 |
6710488141 |
6710488141 |
No value |
No value |
No value |
No value |
-
Result of splitting into 3 columns:
Phone number | Phone number part 1 | Phone number part 2 | Phone number part 3 |
---|---|---|---|
+17755784197,4063492232,504.626.7424 |
+17755784197 |
4063492232 |
504.626.7424 |
4063492232,504.626.7424 |
4063492232 |
504.626.7424 |
No value |
6710488141 |
6710488141 |
No value |
No value |
No value |
No value |
No value |
No value |
Example 2: The Customer name column needs to be split into 2 columns, with space as the delimiter.
Customer name | Customer name part 1 | Customer name part 2 |
---|---|---|
Joe Michael Smith |
Joe |
Michael Smith |
Jane Smith |
Jane |
Smith |
Anne |
Anne |
No value |
Smith |
Smith |
No value |
Example 3: Address is on multiple lines and it needs to be split into three individual records. You can use \n
as the delimiter here. See Remarks for more information.
Address | Address part 1 | Address part 2 | Address part 3 |
---|---|---|---|
Baker Street 221b NW1 6XE London |
Baker Street 221b |
NW1 6XE |
London |
Remarks
-
The original column is kept in the dataset by default. If you want to remove it, use the Delete column(s) step.
-
You can use escape sequences (
\n
,\t
, etc.) as delimiters. -
To use a backslash itself as a delimiter, enter two backslashes.
Filter rows based on formula
The Filter rows based on formula step allows you to filter your data set. It can keep or remove rows based on the result of the specified formula.
Parameters
-
Filter action: select what to do with rows for which the formula returns true:
-
Keep rows: default, keep the rows for which the formula returns true. All other rows are removed.
-
Remove rows: remove the rows for which the formula returns true. All other rows are removed.
-
-
Formula: the formula (condition) which decides whether to keep or remove the rows. It must be a formula that returns true or false.
Usage
Filter works by evaluating the formula for each row in your data set. Based on the result (whether the formula returns true
or false
) and configured Filter action it will decide whether to keep the row or remove it from further processing.
Filter step never changes the data in any way - it simply removes subset of the data from further processing based on the condition provided by the user.
Note that when you filter, the preview will show you only subset of the records that are available in the sample. This means that the preview output might be much more limited compared to what you’ll get by running the job in full. See examples below to see this. |
To only keep rows in the contacts data set where the contact address is in Alaska, we can use filter configured like this:
The result after applying the above step may look like this:
Notice that only 5 rows are shown in the preview. When you run the job, you will get 62 rows in the output:
This is because when working with interactive job editor, you are only working with a sample that has 1000 rows by default. However, when you run the job, it works with complete data set which in this case has 10000 rows and therefore also contains more than just 5 contacts from Alaska.
To remove rows that have empty value in certain column from your data set, you can configure the step like this:
Note that since $dateOfBirth
is a date column, we compare with null
to determine if the value is present or not.
The same approach will work for decimal, integer and boolean columns. For string columns, you can use formula isBlank($stringColumn)
instead.
Alternatively, you can also write $stringColumn == null || $stringColumn == ""
- this is less generic than using isBlank
function as it does not catch values that are composed entirely of spaces.
In this example, we will remove all contacts who are younger than 21 years and do not have a phone number.
In this case we set the Filter action to Remove so that rows matching the condition are removed rather than kept in the data set.
The formula used in the step is the following:
dateDiff(today(), $dateOfBirth, year) < 21 AND isBlank($phone)
In the first line of the formula, we calculate the age of the customer by using a dateDiff
function to calculate the difference between two dates combined with today()
function to get today’s date. The result is compared with 21 to see if the customer is younger than 21.
In the third line above we test if the $phone
column is empty by using an isBlank
function.
Both parts (age test and phone test) are combined with AND
operator so both conditions must be true
for the formula to also return true
. You can often see this operator written as &&
.
With settings like this, the step will remove small number of customers from the data. To see the full effect of the step, we can run the job to get result like this:
Remarks
-
When working in an interactive job editor, the step works with sample. Full data set may contain more records that satisfy filtering conditions. To get full picture of your data after it has been filtered, run the job via Run job button.
Sort
Sort allows you to sort your data based on selected columns and direction (ascending or descending).
Parameters
Sort step configuration is defined as a list of sort key settings. Each sorting key is represented by a column and sorting direction. This allows you to easily configure sorting by multiple columns. Each sort key is configured by using two parameters:
-
Sort by: a key column the data set will be sorted by. All column data types are supported for sorting. Each column can appear only once.
-
Sort order: configure the sorting direction. The choices will depend on the data type:
-
For numeric columns (integer and decimal) select either Smallest to largest (ascending) or Largest to smallest (descending order).
-
For string columns select either A-Z (ascending) or Z-A (descending).
-
For date columns select Oldest to newest (ascending) or Newest to oldest (descending).
-
For boolean columns select either False first or True first.
-
Examples
To find most expensive products, we can sort our product data set by product Unit price
column in descending order (i.e., Largest to smallest):
The output will look like this:
If we wanted to see youngest customers in each state, we can do this by sorting our contacts data set by State
column, then by Date of birth
(in descending order) and finally we can also sort by Last name
and First name
to make the output nicer.
The output may look like this:
Remarks
-
Each Sort step can have multiple sorting columns. You can add them in any order and then use drag handles to the left of each sorting key column to rearrange the columns.
-
Sorting can take long time to finish on very large data sets.
-
Sorting of string column uses case-sensitive comparison. This means that lower-case letters are considered to be before upper-case letters when sorting in ascending order.
-
Note that if you have numbers stored in a string column, it will not be sorted properly by number value but rather lexicographically. For example, consider the following picture where we have data sample with numbers in string column sorted in "ascending" order (from smallest to largest):
To sort these numbers properly, convert the column to integer (or decimal) first:
Notice the changed header icon - it shows the column is now integer instead of a string.
-
Error is treated as the lowest value for any data type. Next lowest value is No value (
null
). For example, the below data set has been sorted based on a date column in ascending (Oldest to newest) order:Notice how both errors are at the top (they are first) regardless of what is written in them, followed by No value and finally the valid data is sorted in ascending order.
Lookup
Lookup step performs a lookup based on key columns from your data set. Lookup step allows you to add columns from your lookup data set to your current data set based on a set of key columns.
Key columns are columns from your data set that will be matched against selected lookup columns. When the values match, columns you selected from lookup data set will be added to your current dataset.
Lookup step principle
Lookup step works by searching for values of key columns in lookup data set. If the match is found, selected columns from lookup data set are added to the rows in the main data set. The result of the lookup therefore always has more columns than the main data set. If a match for lookup key column values is not found in the lookup data set, empty values are used for the columns in the output (these will show as No value in the data preview). In more technical terms, this process is called left outer join.
The principle of a lookup step can be illustrated on the following diagram:
The above diagram uses following lookup settings:
-
Key column is Product code in both data sets.
-
Columns to add from lookup is set to Product name and Product code.
Notice how the last row of the output does not have any value in Product name and Unit price columns. This is because that line item uses Product code 99 which does not appear in lookup data set.
Parameters
-
Lookup data source: required, lookup source data set. A data set that is added to the main data set. The lookup data set must be available in My sources before it can be used here. See Example below for more details.
-
Lookup key mapping: required, defines pairs of columns from both data sets whose values must match.
-
Data column represents a column in the main data set.
-
Lookup column represents a column in the lookup data set.
-
-
Columns to add from lookup: required, configures list of columns from the lookup data set to be added to the current data set. Two options are available:
-
All columns will add all lookup columns including the key columns (which will therefore be duplicated in the result).
-
Selected columns will add only columns selected in the dropdown.
-
Example
Add a Exchange rate column to your invoice data and use that exchange rate to calculate invoice amount in USD. The new column should contain exchange rate of a currency respective to USD.
Exchange rates are provided by a file lookup-example-exch-rates.csv
with two columns:
-
Currency code: 3-letter currency code.
-
Exchange rate: exchange rate of 1 USD to the currency defined by Currency code (i.e. how many of the other currency for 1 USD).
-
Create lookup data source via My sources table. The easiest way os to drag & drop the
lookup-example-exch-rates.csv
file to Drop a file box in upper left corner of of My sources screen. -
Preview the currency rates in My sources to ensure the file was parsed properly.
-
Create a new job called
Invoices in USD
. Load the invoices data as the job’s data source. -
In the transform editor, add Lookup step (either from toolbar or by searching via Add step button) and configure it as shown on the screenshot.
-
Once the lookup is added, you should see an output like the following screenshot. Notice the No value cells in the column we added - these are there because currency pair USD-USD with exchange rate of 1.0 is not in the exchange rates file.
-
To fix the No value cells, add Replace empty values step and set it to replace empty values with value 1.
-
Then, to compute the final invoice amount in USD, add a Calculate formula step with the following formula:
$Invoice_amount / $Exchange_rate
. Finally, to round the data to 2 decimal places, you can add a Round step. Once all the steps are added, you should see an output like this:
Remarks
-
Lookup data set must be available in Data Catalog in My sources before it can be used.
-
Lookup data set should not contain duplicate keys. If that is the case, the last value will be used.
-
When there are values in your dataset for which there are no matches in the lookup data set, the output column values for those keys will be set to No value.
-
Lookups are case-sensitive, and the data used as keys in the lookup must match exactly with the data in your data set in order to be matched.
Calculate formula
The Calculate formula step calculates the value of a cell using a formula. Formulas can return a value of any type. Formulas can reference other columns using $Column_Name
syntax and many built-in functions.
Learn the basics of writing formulas Visit Formulas and calculations to learn how to work with formulas. |
Parameters
-
Formula: required, the formula to apply to create the target column values. Formula can return a value of any type. Formulas can reference other columns using
$Column_Name
syntax and many built-in functions. -
Target column: required, configure the column which will receive the output. Output type will be determined automatically depending on the output type of the formula.
-
Write result to the current column: overwrite the input column with the result.
-
Create new column with name: create a new column with specified name. Name of the new column can contain spaces or special characters - technical column name will be created automatically. The new column will be placed right after the input column.
-
Examples
Calculate the late fees for invoices. Late fees are derived from invoice due date and when it was paid. Late fees are added to invoices which are paid more than 15 days after their due date. Late fee is then 0.5% of the amount for each date the invoice is not paid past the due date.
-
Load the invoices data set. it has two columns -
Due date
andPaid date
- which provide information about invoice payments.Paid date
is empty if the invoice was not yet paid. You data may look like this: -
You can use dateDiff function in Calculate formula step to calculate how many days are between two dates.
-
Once you add the step, you will get output that can look like the following screenshot. Notice the error cells highlighted in red - they show rows where invoices have not yet been paid so it is not possible to calculate the number of days. We’ll clean this in next steps.
-
To clean the errors from previous step, use Clear error cells step. This step will remove the errors and use
null
(No value) instead. You will get an output like this: -
To calculate the late fees, we can use formula like this:
if($Paid_after_days > 15, 0.005 * $Invoice_amount * ($Paid_after_days - 15), 0)
The formula uses
if
function to only run whenPaid after days
is greater than 15. If invoice is paid in 15 days or less, we do not add any fees. Otherwise we calculate the fee as 0.5% of theInvoice amount
for each day the payment is late. -
The output will look like this. Notice again that we have errors in the output - this time caused by No value cells in the
Paid_after_days
column. No value cells cannot be used in calculation and hence an error is raised. We can solve this in two ways:-
Improve our formula from the previous step with more complex condition to not calculate anything for those rows. This can be done for example like this:
if($Paid_after_days == null, 0, if($Paid_after_days > 15, 0.005 * $Invoice_amount * ($Paid_after_days - 15), 0 ) )
-
Use Replace errors step to clean the result after the calculation. This is the approach we’ll use here.
-
-
Finally, we can round the output to 2 decimal places to make it nicer. Do not forget to set the display format for the column to get nicer output.
Remarks
-
The target column data type is automatically determined based on the type returned by the formula. If the target column is of a different type, the type of the target column will be changed to match the formula’s return type.