Version

    CloverDX Data Profiler essentials

    This section describes the very basic concepts that must be understood before using the profiler.

    CloverDX Data Profiler is a part of CloverDX Designer. During the startup of the application you will be asked to enter the location of your Workspace. The workspace is the directory where all the working data and settings produced by the application will be stored. Make sure to choose a location where the application has write access rights. (e.g. on Windows Vista or higher, do not choose the Program Files directory.)

    Selecting workspace

    Figure 4.2. Selecting workspace


    You can use CloverDX Designer Projects to organize all your data profiling work. Each Project contains work related to one or more Jobs. Projects may contain additional files and folders with any resources you need, such as the input data, output data, external database connections, external metadata, project documentation etc. All your profiling and ETL resources can share the same projects.

    A Project may contain one or more Data Profiler Jobs. Each job is linked to a specific data source. There are three possible kinds of data sources you can profile:

    • Flat file - either the delimited (CSV) or fixed-length format

    • Database table

    • Excel sheet - XLS(X) file

    A job has Metadata assigned to it defining structure of the data source and a list of columns which can be analyzed.

    For each column of the data source, you define Metrics which actually analyze your data. A metric can be as simple as a single statistical value (average value, minimum, null objects count) up to an elaborate frequency analysis which is displayed as a graphical chart.

    You do not have to run profiling on the whole data source. Instead, set one of the Sampling methods to get a representative proportion of your data.

    Executing a job produces a Run. You will very often run jobs repeatedly to e.g. check whether your data has been changing through a period of time. Results of your runs can be observed, exported or printed out. You may also go back to the job, modify it and run the job with the new settings.