Version

    ProfilerProbe

    Licensed under CloverDX Data Quality package.

    Short Description
    Ports
    ProfilerProbe Attributes
    Details
    Compatibility
    Troubleshooting
    See also

    Short Description

    ProfilerProbe analyses (profiles) input data. The big advantage of the component is the combined power of CloverDX solutions with data profiling features. So it makes profiling accessible in very complex workflows such as data integration, data cleansing and other processing tasks.

    ProfilerProbe is not limited to only profiling isolated data sources; instead, it can be used for profiling data from various sources (including popular DBs, flat files, spreadsheets etc.). ProfilerProbe is capable of handling all data sources supported by CloverDX's Readers.

    [Note]Note

    To be able to use this component, you need a separate Data Quality license.

    ComponentSame input metadataSorted inputsInputsOutputsEach to all outputsJavaCTLAuto-propagated metadata
    ProfilerProbe-
    no
    11-n
    yes
    no
    no
    yes

    Ports

    Port typeNumberRequiredDescriptionMetadata
    Input0
    yes
    Input data records to be analyzed by metrics.Any
    Output0
    no
    A copy of input data records.Input port 0
    1-n
    no
    Results of data profiling per individual field.Any

    Metadata

    ProfilerProbe propagates metadata from the first input port to the first output port and from the first output port to the first input port.

    ProfilerProbe does not change the priority of propagated metadata.

    If any metric is set up in the component, the component has a template ProfilerProbe_RunResults on its second output port. The field names and data types depend on used metrics.

    ProfilerProbe Attributes

    AttributeReqDescriptionPossible values
    Basic
    Metrics[1]

    Statistics you want to be calculated on metadata fields. You can apply all metrics as in standalone Profiler jobs. Learn more about metrics in the Metrics section of the Data Profiler documentation.

    See Appendix A. List of Metrics in Data Profiler documentation.

    Output mapping[2]

    Maps profiling results to output ports, starting from port number 1. See Details.

     
    Advanced
    Metrics URL[1]Profiler job file containing the Metrics settings.*.cpj
    Output mapping URL[2]External XML file containing the Output mapping definition. 
    Processing mode 

    Always active (default) - default mode to execute the ProfilerProbe component locally and remotely (if executed on the server).

    Debug mode only - select this mode to capture execution data for debugging purpose, similar to debug mode on component edges. Note that when executing a graph with this mode selected for ProfilerProbe:

    • runs as expected when server debug_mode = true (see Job Config Properties).

    • when server debug_mode = false, the input data would continue through the first output port, but it does not send profiling of data to subsequent output ports.

    Always active (default) | Debug mode only
    Persist results 

    In Server environment, the profiling results will also be stored in the profiling results database. This can be switched off, by setting this attribute to false.

    When executing a graph with ProfileProbe on Worker, persisting results is currently not supported. If you need to persist Probe profiling results, force the graph execution on CloverDX Server Core using the worker_execution property. For more information, see the Job Execution section.

    true (default) | false
    Job UUID 

    Set up this field if you need to have results re-sorted in a reporting console under this UUID. This field is useful in the case of moving the ProfilerProbe component to other graph. If you set up this attribute to the value of original UUID, the results in reporting console would continue using the same UUID. Otherwise new UUID would be generated.

     

    [1]  Specify only one of these attributes. (If both are set, Metrics URL has a higher priority.)

    [2]  Specify only one of these attributes. (If both are set, Output mapping URL has a higher priority.)

    Details

    ProfilerProbe calculates metrics of the data that is coming through its first input port. You can choose which metrics you want to apply on each field of the input metadata. You can use this component as a 'probe on an edge' to get a more detailed (statistical) view of data that is flowing in your graph.

    The component sends an exact copy of the input data to output port 0 (behaves as SimpleCopy). That means you can use ProfilerProbe in your graphs to examine data flowing in it, without affecting the graph's business logic itself.

    The remaining output ports contain results of profiling, i.e. metric values for individual fields.

    Output mapping

    Editing the Output mapping attribute opens the Transform Editor where you can decide which metrics to send to output ports.

    Transform Editor in ProfilerProbe

    Figure 62.1. Transform Editor in ProfilerProbe


    The dialog provides you with all the power and features known from Transform Editor and CTL. In addition, notice metadata on the left hand side has a special format. It is a tree of input fields AND metrics you assigned to them via the Metrics attribute. Fields and metrics are grouped under the RunResults record. Each field in RunResults record has a special name: fieldName__metric_name (note the underscore is doubled as a separator), e.g. firstName__avg_length.

    Additionally there is another special record containing three fields - JobUid, inputRecordCount and profilerRunId. After you run your graph, the field will store the total number of records which were profiled by the component. You can right-click a field/metric and Expand All, or Collapse All metrics.

    To do the mapping in a few basic steps, follow these instructions:

    1. Provided you already have some output metadata, just left-click a metric in the left-hand pane and drag it onto an output field. This will send profiling results of that particular metric to the output.

    2. If you do not have any output metadata:

      1. Drag a Field from the left hand side pane and drop it into the right hand pane (an empty space).

      2. This produces a new field in the output metadata. Its format is: fieldName__metric_name (note the underscore is doubled as a separator), e.g. firstName__avg_length.

      3. You can map metrics to fields of any output port, except for port 0. That port is reserved for input data (which just flows through the component without being affected in the process).

    [Note]Note

    The output mapping uses CTL (you can switch to the Source tab). All kinds of functions are available to help you learn even more about your data, for example:

    double uniques = $out.0.firstName__uniques; // conversion from integer
    double uniqInAll = (uniques / $in.0.recordCount) * 100;
                    

    calculates the per cent of unique first names in all records.

    If you do not define the output mapping, the default output mapping is used:

    $out.0.* = $in.0.*;

    The default output mapping is available since version 4.1.0.

    Importing and Externalizing metrics

    In the Metrics dialog, you can have your settings of fields and their metrics externalized to a Profiler job (*.cpj) file, or imported from a Profiler job (*.cpj) file into this attribute. There are two buttons at the bottom of the dialog for this purpose: Import from .cpj and Externalize to .cpj. The externalized .cpj file can be used in the Metrics URL attribute. The Externalize to .cpj action fills in this attribute automatically.

    Import/Externalize metrics buttons

    Figure 62.2. Import/Externalize metrics buttons


    ProfilerProbe Notes & Limitations

    This short section describes the main differences between using the ProfilerProbe component and profiling data via *.cpj jobs.

    • It performs analyses just on the data which comes through its input edge. Profiling results are sent to output ports. Please note that you do not need any results database. In Server environment, the component will send the results also to the profiling results database. Such results can further be viewed using the CloverDX Data Profiler Reporting Console.

    • It is able to use data profiling jobs (*.cpj) via the Metrics URL attribute.

    • If you want to use sampling of the input data, connect the DataSampler (or other filter) component to your graph. There is no built-in sampling in ProfilerProbe.

    • In Cluster environment, the component will profile data from each node where it is running. Therefore, the results are only applicable to the portions of data processed on given node. If you need to compute metrics for data from all nodes, first gather the data to single node where this component will run (e.g. by using ParallelSimpleGather). Note: in case the component is running on multiple nodes, it will also produce multiple run results in the profiling results database, each of them applicable only to the portion of data processed on each single node. Typically, for Cluster environment, you may therefore wish to turn off the persist results feature.

    Compatibility

    VersionCompatibility Notice
    4.1.0Default mapping is now available.

    Troubleshooting

    The ProfilerProbe component can report an error similar to:

    CTL code compilation finished with 1 errors
    Error: Line 5 column 23 - Line 5 column 39: Field 'field1__avg_length' does not exist in record 'RunResults'!

    This means that you're accessing a disabled metric in output mapping - in this example the Average length is not enabled on the field field1.