Error handling steps
-
Remove rows with errors: remove rows that contain errors in selected columns
-
Replace errors: replace cells which contain errors with provided value or formula calculation result
-
Clear error cells: clear cells that contain errors (set their value to null)
-
Replace empty values: replace empty values in column with provided value
-
Fill down: replace empty values with previous non-empty values
-
Calculate formula: fix errors in your data using custom formulas
Remove rows with errors
The Remove rows with errors step removes all rows that have error in the specified column or columns.
Parameters
-
Error position: required, select where to look for errors:
-
in any column: consider all columns from the whole data set and remove rows that contain errors in any column.
-
in selected column: only consider selected column and remove rows that contain error in that column only.
-
Example
Consider a data set like the one below where we have payments data and we were unable to convert $orderId
column to integer. This means that we would not be able to join the data with order data and then further with customer data.
To avoid transforming rows that cannot be used, we can easily remove them all from the process by using Remove rows with errors configured to look into $orderId
column. The output will then look like this:
Replace errors
The Replace Errors step replaces all errors in the selected column with the specified constant value or the result of the specified formula. Rows that do not have any error in the selected columns will not be changed.
When replacing with a constant, all errors in the selected column will be replaced with the same value.
When replacing with a formula, the formula will be calculated only for rows where there is an error in the selected column.
Parameters
-
Input Column: required, column in which you’d like to replace the errors.
-
Replace errors with: required, configures what is the replacement value for errors in the column:
-
Value: allows you to configure constant value to replace all errors with. The replacement value type must match the type of your column. The type then determines the formatting you have to use. See more in the Remarks section.
-
Formula: allows you to write a formula to calculate for each row with an error. The formula can access the original value of the cell in case the error is a data error. See more details in examples below.
-
Example
In our example invoices data set we could get data that looks like this:
Notice the Date paid
column - it contains a variety of errors that we can fix with the Replace errors step. We’ll be able to recover data from the cells that contain incorrectly formatted dates and we’ll be able to remove n/a
values from cells that contain it.
To fix the cells with the wrong format, we can use a formula that attempts to parse the data in a different format:
Once we apply this step, we will see our data quality improve:
We can fix the Invoice amount
column in a similar way - some of the values in that column look like (9426.13)
. This is a number format sometimes used in accounting to represent negative values. To parse this format, we can use the same approach as for fixing the date, but with a slightly different step formula in the step:
str2decimal($Invoice_amount, "0.00;(0.00)")
This formula relies on the fact that we can specify two different patterns for numbers - the part of the pattern before the semicolon specifies how to read positive numbers while the second part specifies how to read negative numbers.
To clean the rest of the errors we can use the Clear error cells step on the Data paid
column.
Remarks
-
When entering a value of the Replace errors with parameter, follow these rules:
-
Integer columns: replacement value must be an integer written without any spaces or other symbols used for grouping (i.e.,
123456
is ok,123,456
is not). -
Decimal columns: replacement value must be a decimal written with a period as the decimal symbol and without any digit grouping (i.e.,
123456.05
is ok,123,456.05
is not). -
String columns: replacement value must be a string and must be written without enclosing double quotes (i.e., write
example
instead of"example"
otherwise quotes will be used in your data). -
Date columns: replacement value must be a date or date and time using one of the following formats:
-
yyyy-MM-dd HH:mm:ss
(e.g.,2023-08-01 08:09:07
) -
yyyy-MM-dd HH:mm
(e.g.,2023-08-01 08:09
) -
yyyy-M-d H:m:s
(e.g.,2023-8-1 8:9:7
) -
yyyy-M-d H:m
(e.g.,2023-8-1 8:9
) -
yyyy-M-d
(e.g.,2023-8-1
) - in this case the date is considered as2023-08-01 0:00:00
-
-
Boolean columns: you must use either
true
orfalse
.
-
Replace empty values
Replace empty values allows you to replace No value or empty strings with a constant value. The step is useful, for example, to provide default values when the input does not provide a value.
Parameters
-
Input column: required, a column of any type in which you’d like replacements to be made.
-
Replacement: required, value to use when an empty string or No value is encountered in specified input column. The data type and formatting of the replacement value depends on the data type of the input column:
-
boolean: use either
true
orfalse
. -
integer: use numbers without any grouping, e.g., use
123456
instead of123 456
. -
decimal: write decimal values without any grouping and with a period as the decimal character, e.g.,
123456.789
instead of123,456.789
. -
string: write a string without enclosing quotes, e.g.
replacement text
. -
date: write date or date and time using one of the following formats:
-
yyyy-MM-dd HH:mm:ss
(e.g.,2023-08-01 08:09:07
) -
yyyy-MM-dd HH:mm
(e.g.,2023-08-01 08:09
) -
yyyy-M-d H:m:s
(e.g.,2023-8-1 8:9:7
) -
yyyy-M-d H:m
(e.g.,2023-8-1 8:9
) -
yyyy-M-d
(e.g.,2023-8-1
) - in this case the date is considered as2023-08-01 0:00:00
-
-
-
Target column: required, configure the column which will receive the output. The data type of the output column will remain without change.
-
Write result to the current column: overwrite the input column with the result.
-
Create new column with name: create a new column with the specified name. The name of the new column can contain spaces or special characters - the technical column name will be created automatically. The new column will be placed right after the input column.)
-
Examples
In this example, we read data from the Color
column and write the result into the Color 2
column. Notice that the empty string in the second row was also replaced with n/a
.
When specifying the date replacement, the formatting of the Replacement parameter is always yyyy-MM-dd HH:mm:ss.SSS
regardless of the formatting of the input column. In the following screenshot, we show two replacements:
-
The Transaction datetime column uses replacement
2023-05-01
without the time part. Midnight on the given date is used. The result is in theTransaction time 2
column. -
The Due date uses replacement
2023-12-24
even though the column formatting is configured asd MMMM yyyy
. The result is in theDue date 2
column.
We read from the Seats
column and write the output into the Seats 2
column.
We read from the Paid amount
column and write the output into the Paid amount 2
column.
Replacement written as false
is used. Note that you cannot use yes
or no
etc. in the replacement value, only true
and false
are allowed.
Clear error cells
The Clear error cells step replaces all errors in the selected column with null
value so they will show up as No value in the data preview. Cells that do not have errors are left without any changes.
Parameters
-
Input Column: required, column in which to clear all errors. All data types are allowed.
Example
In the example below we are trying to calculate the how many days past the due date is an invoice. This is done in step #6 on the screenshot with Calculate formula step with dateDiff($Date_paid, $Due_date, day)
formula.
This formula does not take into consideration No value cells in the Date paid
column - these are there for invoices which have not been paid yet. This means that for those rows, the formula results in an error:
Paid after days
caused by not handling No value cells in the Date paid
column.To solve this, we can change the formula and only run it if we also have value in Paid date
. This makes the formula harder to understand and especially in cases where the formula is more complex with more parameters it can be quite hard to see what it really does.
A simpler solution is to leave the formula as is and simply clear the error values in Paid after days
columns after the formula step. This can be done with Clear error cells step. Note that in this case using Replace errors step does not make sense as there is no default value to use for unpaid invoices.
Paid after days
column.Remarks
-
The step works with every data type.
-
The step does not distinguish between different kinds of errors and will replace the error for both data errors as well as step execution errors.
Fill down
The Fill down step replaces empty values in a column with the last preceding available non-empty value from the same column. This helps maintain consistency in datasets by ensuring that gaps in data are filled with relevant preceding values, making it useful for structured and sequential data processing.
Parameters
-
Input column: required, select the column where the step will be applied to replace empty values.
Example
Imagine you have a sales report that includes the Region for each sale, but only the first sale in each region is labeled, and the following rows are left empty. To organize the data better, you want to fill down the region name for all sales in the same region.
Region | Sales | Region | Sales |
---|---|---|---|
Before the Fill down step is applied |
After the Fill down step is applied |
||
North |
1000 |
North |
1000 |
No value |
1500 |
North |
1500 |
No value |
2000 |
North |
2000 |
South |
1200 |
South |
1200 |
No value |
1300 |
South |
1300 |
East |
1700 |
East |
1700 |
No value |
1800 |
East |
1800 |