Version

    ParquetWriter

    Incubation

    Short Description
    Ports
    Metadata
    ParquetWriter Attributes
    Details
    Limitations
    Compatibility
    See also

    Short Description

    ParquetWriter writes data into Parquet files.

    The component supports file compression, writing to local files, remote files and writing to output port.

    Component Data output Input ports Output ports Transformation Transf. req. Java CTL Auto-propagated metadata
    ParquetWriterParquet file10-1

    Ports

    Port typeNumberRequiredDescriptionMetadata
    Input0 For records to be written to a Parquet file 
    Output0 For Output Port WritingParquetWriter_Output

    The output port writing supports both discrete and stream write modes.

    Metadata

    ParquetWriter does not propagate metadata.

    The component has no metadata template.

    The component auto-generates a Parquet schema from the metadata on input port.

    ParquetWriter Attributes

    AttributeReqDescriptionPossible values
    Basic
    File URLyesURL to a Parquet file to be written. Output port writing is supported. 
    Parquet schemano

    User customization of CloverDX to Parquet data type conversion.

    By default it is auto-generated from input port metadata.

     
    Advanced
    Create empty files If set to false, prevents the component from creating an empty output file when there are no input records. true (default) | false
    Create directories If set to true, non-existing directories in the File URL attribute path are created. false (default) | true
    Compression type Type of compression used for the output file.snappy (default) | gzip | uncompressed
    Row group size 

    Row group size of the Parquet file format.

    See Apache Parquet documentation for details.

    512MB (default)
    Page size 

    Data page size of the Parquet file format.

    See Apache Parquet documentation for details.

    8KB (default)

    Details

    The Parquet format attributes Row group size and Page size influence how the data is organized inside the output Parquet file. These attributes can be fine-tuned to optimize output file size or write performance, more details can be found in the Apache Parquet documentation.

    The component holds up to Row group size in heap memory at once.

    The component always overwrites existing target files. Appending to an existing file is not supported.

    Parquet Schema

    The component attribute Parquet schema can be used for customization of CloverDX to Parquet data type conversion. The customization is done in a Parquet schema mapping dialog, shown in Figure 65.1, Parquet file schema dialog.

    In this dialog, you can see all the metadata fields, their types and a target Parquet type. This is an abstraction above Parquet primitive and logical types (as described in Apache Parquet documentation). The mapping of Parquet types to primitive and logical types is described in table Parquet Types. The Parquet type selection offers only those types compatible with the specific CloverDX field data type.

    Parquet file schema dialog

    Figure 65.1. Parquet file schema dialog


    The implicit auto-mapped types are shown in grey italic font, in contrast to manually mapped types, shown in black. If the metadata contains a field with an unsupported data type (e.g. a map), it is shown as Unsupported also in the mapping dialog and the Parquet type selection is disabled.

    The Parquet schema is stored as a JSON in the job XML. This allows for better readability of the mapping and tracking of changes.

    Table 65.6. Parquet Types

    Parquet TypePrimitive TypeLogical TypeProperties
    StringBYTE_ARRAYSTRING 
    EnumBYTE_ARRAYENUM 
    Integer

    INT32 (width <= 32)

    INT64 (width == 64)

    INT(width, signed)

    width (8/16/32/64)

    Decimal

    INT32 (1 <= precision <= 9)

    INT64 (10 <= precision <= 18)

    BYTE_ARRAY (19 <= precision)

    DECIMAL

    precision (>=0)

    scale (>=0)

    DateINT32DATE 
    Time

    INT32 (precision == millis)

    INT64 (precision == micros/nanos)

    TIME

    precision (millis/micros/nanos)

    utcAdjustment (true/false)

    TimestampINT64TIMESTAMP

    precision (millis/micros/nanos)

    utcAdjustment (true/false)

    IntervalFIXED_LEN_BYTE_ARRAYINTERVAL 
    BSONBYTE_ARRAYBSON 
    JSONBYTE_ARRAYJSON 
    DoubleDOUBLE  
    BooleanBOOLEAN  
    BinaryBYTE_ARRAY  

    Limitations

    Writing of nested types (lists and maps) are not supported.

    The unsigned INT types and the INT96 primitive type are not supported.

    The UUID logical type is not supported.

    Partitioning (by value, or by record count) is not supported.

    Compatibility

    VersionCompatibility Notice
    5.10.0

    ParquetWriter is available since 5.10.0 in incubation mode.