Appendix A. List of Metrics
Metric | Return type | Description | Note |
---|---|---|---|
Statistics | |||
Minimum value | number / string | Returns the lowest value of all numbers. The metric works with strings, too. Example: let us have two strings, "zoo" and "city-zoo". The second one is Minimum value, because alphabetically, it would be the first item of the two. | |
Average value | number | Calculates the average value of all numbers. | |
Median | number | Calculates the median of your data. Be advised that profiling huge data sets with this metric works slower (the reason is necessary values are continually stored in the database until the median is finally calculated). | |
Mode (modus) | string | Returns the value which is most frequently repeated in the data set. Be advised that profiling huge data sets with this metric works slower (the reason is necessary values are continually stored in the memory until modus is finally calculated). | |
Maximum value | number / string | Returns the highest value of all numbers. The metric works with strings, too. Example: let us have two strings, "zoo" and "city-zoo". The first one is Maximum value, because alphabetically, it would be the last item of the two. | |
Length | |||
Minimal length | integer | Examines the data set and returns the length of the string consisting of the lowest number of characters. | |
Average length | number | Calculates the average length of all field values. | |
Maximal length | integer | Examines the data set and returns the length of the string consisting of the highest number of characters. | |
Shortest string | string | Returns the shortest found string of the data set. If there are more of them, the first one alphabetically is returned. | |
Longest string | string | Returns the longest found string of the data set. If there are more of them, the last one alphabetically is returned. | |
Null Handling | |||
Null count | integer |
Returns the count of values that are null ,
i.e. they carry no data at all.
| |
Not null count | integer |
Counts all fields that carry some data, i.e. they are not null .
| |
First not null value | string | Returns the first value found which is not null. | |
String format | |||
Most frequent patterns | string | The metric examines all strings and returns a mask that describes how those strings most commonly look like. You can configure the number of returned patterns via Patterns count. Example result: "27,1%: A99 9AA 9,2%: AA9 5,8%: 999" - it tells you that 27,1% of strings looked like "A99 9AA", 9,2% were "AA9" and 5,8% were "999" where A stands for an arbitrary character and 9 is a digit. One such "A99 9AA" string could be e.g. "M64 1se". Profiling huge data sets with this metric may work slower if they contain "ugly" strings consisting of brackets, quotes, commas and other non-alphanumeric characters. | When used within the ProfilerProbe component, this metric returns a map with keys containing patterns and values containing their occurrences. |
Least frequent patterns | string | The metric examines all strings and returns a mask that describes how those strings least commonly look like. You can configure the number of returned patterns via Patterns count. Example result: "5,8%: 999 9,2%: AA9 27,1%: A99 9AA " - it tells you that 5,8% of strings looked like "999", 9,2% were "AA9" and 27,1% were "A99 9AA" where A stands for an arbitrary character and 9 is a digit. One such "A99 9AA" string could be e.g. "M64 1se". Profiling huge data sets with this metric may work slower if they contain "ugly" strings consisting of brackets, quotes, commas and other non-alphanumeric characters. | When used within the ProfilerProbe component, this metric returns a map with keys containing patterns and values containing their occurrences. |
Convertible to date | integer | The metric finds out how many records can be converted to a date format. The format you are looking for has to be specified in Mask. Setting the appropriate Locale allows the metric to e.g. recognize month names written as strings. | |
Convertible to number | integer | The metric finds out how many records can be converted to a number. Optionally, type in the Format pattern you are looking for. The pattern uses the same syntax as CloverDX Designer (e.g. '0' for digit, '.' for decimal separator) - please refer to its documentation, section Numeric Format. | |
Non-ASCII records | integer | Returns the number of records containing non-ASCII characters. | |
Non printable ASCII | string | Gets all ASCII characters that cannot be printed (e.g. Delete or Escape). | When used within the ProfilerProbe component, this metric returns a list of the non printable ASCII characters. |
Frequency | |||
Uniques count | integer | Returns the amount of values that are unique within the data set. Thus, you can e.g. easily pick a field which could serve as the primary key. Profiling huge data sets with this metric works slower as unique values are gradually stored in the memory. | |
Interval chart | chart |
A histogram chart showing values divided into interval bins
with a number of occurrences per each value.
It can be used for numerical (integer, long, double, decimal) and date fields.
This metric can either work in an automatic mode (no configuration)
or you can set its properties yourself.
The automatic mode always sets the histogram so that all your data could be displayed.
The number of buckets is chosen adequately.
Thus, the automatic mode gets you a good overview of how your data is spread.
Afterwards, you may want to focus on a certain part of the data.
In that case, set histogram properties.
Be advised you have to set all of them, otherwise the job will fail on running.
The aforementioned properties differ for integer fields and date fields.
As for integer fields, you set:
| |
Frequency chart | chart | A simple chart showing the number of occurrences per each value. It can be used for strings and numerical fields (integer, long). Use it to analyze long listings, e.g. (postal) codes. You define the maximum number of unique values the metric will work with - Maximum unique values. If this threshold is exceeded during the computation, the histogram will not be shown. In case you are profiling a file with many unique values, the histogram will allow you to switch between the Most common and Least common ones. | |
Time unit chart | chart |
A special chart for date fields.
Dates are classified into buckets according to
what you set in Category
(e.g. second, day of month, week of year).
That is, choosing month, you will have 12 buckets while for minute,
there will be 60 buckets etc.
For instance, let us have two dates:
1900-01-02 12:34:56 2011-06-22 22:34:11 These would fit into the same bucket only if you chose Minute as Category. For all other categories, they would fall in different buckets. |