Version

    Chapter 61. Data Partitioning

    Common Properties of Data Partitioning Components

    Components from this category are primarily dedicated for data flow management when using Data Partitioning or in CloverDX Cluster environment, which provides an ability of massive parallelization of data transformation processing. Each component in a transformation graph running with data partitioning enabled or in Cluster environment can be executed in multiple instances - this is called component allocation. Component allocation specifies how many instances will be executed, and where (on which Cluster nodes) will they be running. For more information, see Chapter 41, Data Partitioning (Parallel Running) or Chapter 42, Data Partitioning in Cluster.

    In general, data partitioning components can be divided into two sub-categories - partitioners and gatherers.

    Parallel partitioners distribute data records from a single worker among various Cluster workers. Parallel partitioners are used to change a single-worker allocation to multiple-worker allocation.

    Parallel gatherers collect data records from various Cluster workers to a single worker. Parallel gatherers are actually used to change a multiple-worker allocation to single-worker allocation.

    • ParallelSimpleGather gathers data records from various Cluster workers; algorithm of the component is based on the SimpleGather component.

    • ParallelMerge gathers data records from various Cluster workers; algorithm of the component is based on the Merge component.

    Out of both basic parallel component groups stands the ParallelRepartition component.

    • ParallelRepartition changes partitioning of already partitioned data, data is re-partitioned. For example, if you have data already partitioned according to a key by the ParallelPartition component, and you would like to change the key or number of partitions, this component can do it in one step, without necessity to gather all partitioned data to a single worker (avoiding bottleneck) by a parallel gather and partition the data again according new rules by a parallel partitioner.