Chapter 61. Data Partitioning
|Common Properties of Data Partitioning Components|
Components from this category are primarily dedicated for data flow management when using Data Partitioning or in CloverDX Cluster environment, which provides an ability of massive parallelization of data transformation processing. Each component in a transformation graph running with data partitioning enabled or in Cluster environment can be executed in multiple instances - this is called component allocation. Component allocation specifies how many instances will be executed, and where (on which Cluster nodes) will they be running. See documentation for Data Partitioning or CloverDX Cluster for more details.
In general, data partitioning components can be divided into two sub-categories - partitioners and gatherers.
Parallel partitioners distribute data records from a single worker among various Cluster workers. Parallel partitioners are used to change a single-worker allocation to multiple-worker allocation.
Parallel gatherers collect data records from various Cluster workers to a single worker. Parallel gatherers are actually used to change a multiple-worker allocation to single-worker allocation.
Out of both basic parallel component groups stands the ParallelRepartition component.
ParallelRepartition changes partitioning of already partitioned data, data is re-partitioned. For example, if you have data already partitioned according to a key by the ParallelPartition component, and you would like to change the key or number of partitions, this component can do it in one step, without necessity to gather all partitioned data to a single worker (avoiding bottleneck) by a parallel gather and partition the data again according new rules by a parallel partitioner.