Data set manipulation steps

End user’s guide > Wrangler user guide > Transformation steps > Data set manipulation steps

Add column: add new column to your data set.
Duplicate column: copy column into a new column.
Delete column(s): remove column fro your data set.
Rename column: change name of a column.
Reorder columns: change the ordering of columns within the data set.
Merge columns: merge values from multiple columns into a single column.
Split column: split column based on a delimiter.
Filter rows based on formula: keep or remove rows based on result of a formula.
Sort: sort data based on selected key columns.
Lookup: merge your dataset with another based on key columns.
Calculate formula: calculate formula and write result to selected column.

Add column

Add column step allows you to add a new empty column of given data type to the specified position in your data set. All rows will have No value in the new column (the value of the column in each row will be null). To populate the column with data, use other steps.

Parameters

New column name: required, the name of your desired column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_".
New column type: required, the data type the new column (string, date, integer, decimal, boolean).
Position of the new column: defines where the new column is added:
- As the last column: insert the column after the last column in the data set.
- As the first column: insert the column before the first column in the data set.
- After column: insert the column after selected column in the data set.

Examples

To add a decimal column called Late fee after Invoice amount column, use the following settings:

Duplicate column

Duplicate column step can be used to create a copy of an existing column - values and formatting are both copied.

Parameters

Source column: required, the column you wish to duplicate.
Name of the target column: required, the name you wish to give the new column. Name can contain special characters (like spaces) - new technical name for the column will be derived by replacing those with underscore character "_". Note that the new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.

When a condition is applied to this step, it takes on a copy function and copies rows that meet the condition from the source column to the target column. For more information on step and group conditions, see Step and group conditions.

Examples

To create a copy of Due date column, configure the step like this:

Remarks

The new column will be placed immediately to the right of the source column - there is no way to control the location of the added column in this step. To do so, you can use Reorder columns step.
New column will have the same type as the input column. It will also inherit the display format. If you change the display format of the input column even after the step was added, the format will be copied to the duplicate column. If you change display format of the duplicate column, it will not be copied to input column.

Delete column(s)

Delete column(s) step allows you to delete columns from the data set.

Parameters

Select column(s) to be deleted: required, select the column(s) to be removed or kept (depending on the Mode of operation) from the data set.
- Display the list of columns by clicking in the Select column(s) field.
- To quickly find the desired column, start typing the column name.
- Select all the columns to be deleted or kept by clicking on them.
- To unselect a column either click on the column name in the column list or click on X on the right side of the column name in the Select column(s) to be deleted field.
- To quickly select or unselect all the columns, click on the check box next to the column name lookup field.
Mode of operation: by default, the selected columns are deleted.
- Remove selected columns, keep the others (default option)
- Keep selected columnns, remove all others

Examples

To delete the Type and Contact columns, configure the step like this:

To delete all columns but the Type and Contact columns, configure the step like this:

Rename column

Rename column step allows you to change the name of a column in your data set. After the step, you will have to use the new name to refer the column.

Parameters

Original Name: required, a column to rename (chosen from the dropdown list).
New Name: required, the name you wish to give to your column.

Examples

To change name of the Invoice no column to Invoice number use the following settings:

Remarks

New name of the column can contain special characters (like spaces). New technical name for the column will be derived by replacing those with underscore character "_".
If a column you are renaming no longer exists in your data set (this can happen either if you changed the steps before Rename column step or if your data source changed), the step will fail. You will get an error similar to Column 'Column_to_rename' doesn’t exist.

Reorder columns

Reorder columns step allows you to change ordering of columns within the data set. It does not modify the data in way, except for changing the relative positions of the columns within the data set.

Parameters

Columns: required, a widget that allows you to reorder columns using the arrows to the left of the column list. Following actions are available in the list:

Icon	Action
	Move the selected column(s) to the beginning of the data set.
	Move the selected column(s) by one towards the beginning of the data set ("up" in the list of columns).
	Move the selected column(s) by one towards the end of the data set ("down" in the list of columns).
	Move the selected column(s) to the end of the data set.

Icon

Action

Move the selected column(s) to the beginning of the data set.

Move the selected column(s) by one towards the beginning of the data set ("up" in the list of columns).

Move the selected column(s) by one towards the end of the data set ("down" in the list of columns).

Move the selected column(s) to the end of the data set.

You can select multiple columns at once with Ctrl+click (or Control+Click on Mac). You can unselect one of the selected columns in the same way - with Ctrl+Click.

Examples

To change the order of columns so that Invoice amount column is the second column, use the widget to select Invoice amount and move it using arrows to the second place in the list:

Remarks

Reorder columns will work also if some of the columns you moved are no longer in the data set (this can happen either if you changed the steps before Reorder columns step or if your data source changed). In such case, the missing columns are ignored by the step and other column movements are performed as configured.

Merge columns

The Merge columns step can be used to merge data from multiple columns into a single column.

Parameters

Columns to merge: required, select the columns to be merged in the sequence you want the values combined. You can change the column sequence by drag and dropping by the grip on the left of a column field.
Delimiter: optional, specify a delimiter to be used between merged values.
- Both single characters (including a space) and multiple characters can be used as delimiters.
- You can also select newline or tab characters as delimiters from the dropdown list.
Ignore blanks: checked by default. When checked, empty values are disregarded and not included in the merged values. This parameter is ignored when no delimiter is specified.
Output column name: required, enter the name of the new column. The name defaults to New column.
Position of the output column:
- As the last column
- As the first column
- After column

Examples

Street	City	Postal Code	Delimiter	Ignore blanks	Result
Baker Street 221b	NW1 6XE	London	(No delimiter specified)	Any	Baker Street 221bNW1 6XELondon
Baker Street 221b	No Value	London	(No delimiter specified)	Any	Baker Street 221bLondon
Baker Street 221b	NW1 6XE	London	,	Any	Baker Street 221b,NW1 6XE,London
Baker Street 221b	No Value	London	,	Yes (checked)	Baker Street 221b,London
Baker Street 221b	No Value	London	,	No (unchecked)	Baker Street 221b,,London
Baker Street 221b	" " (4 spaces) ^[1]	London	,	Any	Baker Street 221b, ,London
Baker Street 221b	NW1 6XE	London	\n ^[2]	Any	Baker Street 221b NW1 6XE London
Baker Street 221b	NW1 6XE	London	\\ ^[3]	Any	Baker Street\221b\NW1 6XE\London

Street

City

Postal Code

Delimiter

Ignore blanks

Result

Baker Street 221b

NW1 6XE

London

(No delimiter specified)

Any

Baker Street 221bNW1 6XELondon

Baker Street 221b

No Value

London

(No delimiter specified)

Any

Baker Street 221bLondon

Baker Street 221b

NW1 6XE

London

Any

Baker Street 221b,NW1 6XE,London

Baker Street 221b

No Value

London

Yes (checked)

Baker Street 221b,London

Baker Street 221b

No Value

London

No (unchecked)

Baker Street 221b,,London

Baker Street 221b

" " (4 spaces) ^[1]

London

Any

Baker Street 221b, ,London

Baker Street 221b

NW1 6XE

London

\n ^[2]

Any

Baker Street 221b
NW1 6XE
London

Baker Street 221b

NW1 6XE

London

\\ ^[3]

Any

Baker Street\221b\NW1 6XE\London

¹	Spaces are not considered an empty value.

²	You can use escape sequences (`\n`, `\t`, etc.) as delimiters.

³	To use a backslash itself as a separator, enter two backslashes.

Split column

The Split column step can be used to split data from one string column into multiple columns based on a delimiter.

Parameters

Input column: required, select a string column to split.
Delimiter: required, specify a delimiter to be used to split the data.
- Both single characters (including a space) and multiple characters can be used as delimiters.
- You can also select newline or tab characters as delimiters from the dropdown list.
Number of columns: required, enter the number of columns the data should be split into.
- The newly created columns keep the Input column name and include part 1, part 2, etc., in the column names.
- If the result is an empty value, No value is displayed. See Working with empty values for more information.
- If the number of columns to be created is lower than the number of the resulting split values, the last column will include the remaining values in the unsplit form.

Examples

Example1: The Phone number column sometimes includes multiple phone numbers, separated by commas.

Result of spliting into 2 columns:

Phone number	Phone number part 1	Phone number part 2
+17755784197,4063492232,504.626.7424	+17755784197	4063492232,504.626.7424
4063492232,504.626.7424	4063492232	504.626.7424
6710488141	6710488141	No value
No value	No value	No value

Result of splitting into 3 columns:

Phone number	Phone number part 1	Phone number part 2	Phone number part 3
+17755784197,4063492232,504.626.7424	+17755784197	4063492232	504.626.7424
4063492232,504.626.7424	4063492232	504.626.7424	No value
6710488141	6710488141	No value	No value
No value	No value	No value	No value

Example 2: The Customer name column needs to be split into 2 columns, with space as the delimiter.

Customer name	Customer name part 1	Customer name part 2
Joe Michael Smith	Joe	Michael Smith
Jane Smith	Jane	Smith
Anne	Anne	No value
Smith	Smith	No value

Customer name

Customer name part 1

Customer name part 2

Joe Michael Smith

Joe

Michael Smith

Jane Smith

Jane

Smith

Anne

No value

Smith

No value

Example 3: Address is on multiple lines and it needs to be split into three individual records. You can use \n as the delimiter here. See Remarks for more information.

Address	Address part 1	Address part 2	Address part 3
Baker Street 221b NW1 6XE London	Baker Street 221b	NW1 6XE	London

Address

Address part 1

Address part 2

Address part 3

Baker Street 221b

NW1 6XE

London

Baker Street 221b

NW1 6XE

London

Remarks

The original column is kept in the dataset by default. If you want to remove it, use the Delete column(s) step.
You can use escape sequences (\n, \t, etc.) as delimiters.
To use a backslash itself as a delimiter, enter two backslashes.

Filter rows based on formula

The Filter rows based on formula step allows you to filter your data set. It can keep or remove rows based on the result of the specified formula.

Parameters

Filter action: select what to do with rows for which the formula returns true:
- Keep rows: default, keep the rows for which the formula returns true. All other rows are removed.
- Remove rows: remove the rows for which the formula returns true. All other rows are removed.
Formula: the formula (condition) which decides whether to keep or remove the rows. It must be a formula that returns true or false.

Usage

Filter works by evaluating the formula for each row in your data set. Based on the result (whether the formula returns true or false) and configured Filter action it will decide whether to keep the row or remove it from further processing.

Filter step never changes the data in any way - it simply removes subset of the data from further processing based on the condition provided by the user.

Note that when you filter, the preview will show you only subset of the records that are available in the sample. This means that the preview output might be much more limited compared to what you’ll get by running the job in full. See examples below to see this.

Simple filtering

To only keep rows in the contacts data set where the contact address is in Alaska, we can use filter configured like this:

The result after applying the above step may look like this:

Notice that only 5 rows are shown in the preview. When you run the job, you will get 62 rows in the output:

This is because when working with interactive job editor, you are only working with a sample that has 1000 rows by default. However, when you run the job, it works with complete data set which in this case has 10000 rows and therefore also contains more than just 5 contacts from Alaska.

Removing rows with empty values

To remove rows that have empty value in certain column from your data set, you can configure the step like this:

Note that since $dateOfBirth is a date column, we compare with null to determine if the value is present or not.

The same approach will work for decimal, integer and boolean columns. For string columns, you can use formula isBlank($stringColumn) instead.

Alternatively, you can also write $stringColumn == null || $stringColumn == "" - this is less generic than using isBlank function as it does not catch values that are composed entirely of spaces.

Filtering with more complex conditions

In this example, we will remove all contacts who are younger than 21 years and do not have a phone number.

In this case we set the Filter action to Remove so that rows matching the condition are removed rather than kept in the data set.

The formula used in the step is the following:

dateDiff(today(), $dateOfBirth, year) < 21
AND
isBlank($phone)

In the first line of the formula, we calculate the age of the customer by using a dateDiff function to calculate the difference between two dates combined with today() function to get today’s date. The result is compared with 21 to see if the customer is younger than 21.

In the third line above we test if the $phone column is empty by using an isBlank function.

Both parts (age test and phone test) are combined with AND operator so both conditions must be true for the formula to also return true. You can often see this operator written as &&.

With settings like this, the step will remove small number of customers from the data. To see the full effect of the step, we can run the job to get result like this:

Remarks

When working in an interactive job editor, the step works with sample. Full data set may contain more records that satisfy filtering conditions. To get full picture of your data after it has been filtered, run the job via Run job button.

Sort

Sort allows you to sort your data based on selected columns and direction (ascending or descending).

Parameters

Sort step configuration is defined as a list of sort key settings. Each sorting key is represented by a column and sorting direction. This allows you to easily configure sorting by multiple columns. Each sort key is configured by using two parameters:

Sort by: a key column the data set will be sorted by. All column data types are supported for sorting. Each column can appear only once.
Sort order: configure the sorting direction. The choices will depend on the data type:
- For numeric columns (integer and decimal) select either Smallest to largest (ascending) or Largest to smallest (descending order).
- For string columns select either A-Z (ascending) or Z-A (descending).
- For date columns select Oldest to newest (ascending) or Newest to oldest (descending).
- For boolean columns select either False first or True first.

Examples

Sorting by single column

To find most expensive products, we can sort our product data set by product Unit price column in descending order (i.e., Largest to smallest):

The output will look like this:

Sorting by multiple columns

If we wanted to see youngest customers in each state, we can do this by sorting our contacts data set by State column, then by Date of birth (in descending order) and finally we can also sort by Last name and First name to make the output nicer.

The output may look like this:

Remarks

Each Sort step can have multiple sorting columns. You can add them in any order and then use drag handles to the left of each sorting key column to rearrange the columns.
Sorting can take long time to finish on very large data sets.
Sorting of string column uses case-sensitive comparison. This means that lower-case letters are considered to be before upper-case letters when sorting in ascending order.
Note that if you have numbers stored in a string column, it will not be sorted properly by number value but rather lexicographically. For example, consider the following picture where we have data sample with numbers in string column sorted in "ascending" order (from smallest to largest):

To sort these numbers properly, convert the column to integer (or decimal) first:

Notice the changed header icon - it shows the column is now integer instead of a string.
Error is treated as the lowest value for any data type. Next lowest value is No value (null). For example, the below data set has been sorted based on a date column in ascending (Oldest to newest) order:

Notice how both errors are at the top (they are first) regardless of what is written in them, followed by No value and finally the valid data is sorted in ascending order.

Remarks

Filter rows based on formula

Lookup

Lookup step performs a lookup based on key columns from your data set. Lookup step allows you to add columns from your lookup data set to your current data set based on a set of key columns.

Key columns are columns from your data set that will be matched against selected lookup columns. When the values match, columns you selected from lookup data set will be added to your current dataset.

Lookup step principle

Lookup step works by searching for values of key columns in lookup data set. If the match is found, selected columns from lookup data set are added to the rows in the main data set. The result of the lookup therefore always has more columns than the main data set. If a match for lookup key column values is not found in the lookup data set, empty values are used for the columns in the output (these will show as No value in the data preview). In more technical terms, this process is called left outer join.

The principle of a lookup step can be illustrated on the following diagram:

Figure 61. Lookup step where main data set contains line items, lookup data set contains products and the output data set contains extended line item table with product details added at the end.

The above diagram uses following lookup settings:

Key column is Product code in both data sets.
Columns to add from lookup is set to Product name and Product code.

Notice how the last row of the output does not have any value in Product name and Unit price columns. This is because that line item uses Product code 99 which does not appear in lookup data set.

Parameters

Lookup data source: required, lookup source data set. A data set that is added to the main data set. The lookup data set must be available in My sources before it can be used here. See Example below for more details.
Lookup key mapping: required, defines pairs of columns from both data sets whose values must match.
- Data column represents a column in the main data set.
- Lookup column represents a column in the lookup data set.
Columns to add from lookup: required, configures list of columns from the lookup data set to be added to the current data set. Two options are available:
- All columns will add all lookup columns including the key columns (which will therefore be duplicated in the result).
- Selected columns will add only columns selected in the dropdown.

Example

Objective

Add a Exchange rate column to your invoice data and use that exchange rate to calculate invoice amount in USD. The new column should contain exchange rate of a currency respective to USD.

Exchange rates are provided by a file lookup-example-exch-rates.csv with two columns:

Currency code: 3-letter currency code.
Exchange rate: exchange rate of 1 USD to the currency defined by Currency code (i.e. how many of the other currency for 1 USD).

Solution

Create lookup data source via My sources table. The easiest way os to drag & drop the lookup-example-exch-rates.csv file to Drop a file box in upper left corner of of My sources screen.
Preview the currency rates in My sources to ensure the file was parsed properly.
Create a new job called Invoices in USD. Load the invoices data as the job’s data source.
In the transform editor, add Lookup step (either from toolbar or by searching via Add step button) and configure it as shown on the screenshot.
Once the lookup is added, you should see an output like the following screenshot. Notice the No value cells in the column we added - these are there because currency pair USD-USD with exchange rate of 1.0 is not in the exchange rates file.
To fix the No value cells, add Replace empty values step and set it to replace empty values with value 1.
Then, to compute the final invoice amount in USD, add a Calculate formula step with the following formula: $Invoice_amount / $Exchange_rate. Finally, to round the data to 2 decimal places, you can add a Round step. Once all the steps are added, you should see an output like this:

Remarks

Lookup data set must be available in Data Catalog in My sources before it can be used.
Lookup data set should not contain duplicate keys. If that is the case, the last value will be used.
When there are values in your dataset for which there are no matches in the lookup data set, the output column values for those keys will be set to No value.
Lookups are case-sensitive, and the data used as keys in the lookup must match exactly with the data in your data set in order to be matched.

Calculate formula

The Calculate formula step calculates the value of a cell using a formula. Formulas can return a value of any type. Formulas can reference other columns using $Column_Name syntax and many built-in functions.

Learn the basics of writing formulas

Visit Formulas and calculations to learn how to work with formulas.

Parameters

Formula: required, the formula to apply to create the target column values. Formula can return a value of any type. Formulas can reference other columns using $Column_Name syntax and many built-in functions.
Target column: required, configure the column which will receive the output. Output type will be determined automatically depending on the output type of the formula.
- Write result to the current column: overwrite the input column with the result.
- Create new column with name: create a new column with specified name. Name of the new column can contain spaces or special characters - technical column name will be created automatically. The new column will be placed right after the input column.

Examples

Objective

Calculate the late fees for invoices. Late fees are derived from invoice due date and when it was paid. Late fees are added to invoices which are paid more than 15 days after their due date. Late fee is then 0.5% of the amount for each date the invoice is not paid past the due date.

Solution

Load the invoices data set. it has two columns - Due date and Paid date - which provide information about invoice payments. Paid date is empty if the invoice was not yet paid. You data may look like this:
You can use dateDiff function in Calculate formula step to calculate how many days are between two dates.
Once you add the step, you will get output that can look like the following screenshot. Notice the error cells highlighted in red - they show rows where invoices have not yet been paid so it is not possible to calculate the number of days. We’ll clean this in next steps.
To clean the errors from previous step, use Clear error cells step. This step will remove the errors and use null (No value) instead. You will get an output like this:
To calculate the late fees, we can use formula like this:
```
if($Paid_after_days > 15,
   0.005 * $Invoice_amount * ($Paid_after_days - 15),
   0)
```
The formula uses if function to only run when Paid after days is greater than 15. If invoice is paid in 15 days or less, we do not add any fees. Otherwise we calculate the fee as 0.5% of the Invoice amount for each day the payment is late.
The output will look like this. Notice again that we have errors in the output - this time caused by No value cells in the Paid_after_days column. No value cells cannot be used in calculation and hence an error is raised. We can solve this in two ways:
- Improve our formula from the previous step with more complex condition to not calculate anything for those rows. This can be done for example like this:
  if($Paid_after_days == null, 0, if($Paid_after_days > 15, 0.005 * $Invoice_amount * ($Paid_after_days - 15), 0 ) )
- Use Replace errors step to clean the result after the calculation. This is the approach we’ll use here.
Finally, we can round the output to 2 decimal places to make it nicer. Do not forget to set the display format for the column to get nicer output.

Remarks

The target column data type is automatically determined based on the type returned by the formula. If the target column is of a different type, the type of the target column will be changed to match the formula’s return type.

{{{ highlightedName }}}

Data set manipulation steps

Add column

Parameters

Examples

See also

Duplicate column

Parameters

Examples

Remarks

See also

Delete column(s)

Parameters

Examples

See also

Rename column

Parameters

Examples

Remarks

See also

Reorder columns

Parameters

Examples

Remarks

See also

Merge columns

Parameters

Examples

See also

Split column

Parameters

Examples

Remarks

See also

Filter rows based on formula

Parameters

Usage

Remarks

See also

Sort

Parameters

Examples

Remarks

Remarks

Lookup

Lookup step principle

Parameters

Example

Remarks

Calculate formula

Parameters

Examples

Remarks

See also