ParquetWriter

Port type	Number	Required	Description	Metadata
Input	0	✓	For records to be written to a Parquet file
Output	0	⨯	For Output Port Writing	ParquetWriter_Output
1	⨯	For successfully written records	ParquetWriter_Success
2	⨯	For failed records	ParquetWriter_Error

Port type

Number

Required

Description

Metadata

Input

✓

For records to be written to a Parquet file

Output

⨯

For Output Port Writing

ParquetWriter_Output

⨯

For successfully written records

ParquetWriter_Success

⨯

For failed records

ParquetWriter_Error

The output port writing supports both discrete and stream write modes.

Metadata

ParquetWriter propagates metadata from input port to output ports 1 and 2.

Metadata on output port 1 contains URL of the file the record was written to.

Metadata on output port 2 contains fields for troubleshooting the failure.

The component has no metadata template.

The component auto-generates a Parquet schema from the metadata on input port.

Table 52. Fields for troubleshooting failed records
Field name	Data type	Description
recordNumber	long	Index of the failed record
fieldName	string	Field that caused the failure
errorMessage	string	Text of the message related to the failure
stacktrace	string	The whole stacktrace of the failure

ParquetWriter Attributes

Attribute Req Description Possible values

Attribute	Req	Description	Possible values
Basic
File URL	yes	URL to a Parquet file to be written. Output port writing is supported.
Parquet schema	no	User customization of CloverDX to Parquet data type conversion. By default it is auto-generated from input port metadata.
Output mapping	no	Defines the mapping for metadata fields on output port 1.
Error mapping	no	Defines the mapping for metadata fields on output port 2.
Advanced
Create empty files		If set to `false`, prevents the component from creating an empty output file when there are no input records.	true (default) \| false
Create directories		If set to `true`, non-existing directories in the File URL attribute path are created.	false (default) \| true
Compression type		Type of compression used for the output file.	snappy (default) \| gzip \| uncompressed
Row group size		Row group size of the Parquet file format. See Apache Parquet documentation for details.	512MB (default)
Page size		Data page size of the Parquet file format. See Apache Parquet documentation for details.	8KB (default)
Format version		Parquet Data page format version. Format version `v1` offers better compatibility with other Parquet implementations (where `v2` format might not be implemented). Format version `v2` produces smaller output files.	v1 (default) \| v2

Basic

File URL

yes

URL to a Parquet file to be written. Output port writing is supported.

Parquet schema

User customization of CloverDX to Parquet data type conversion.

By default it is auto-generated from input port metadata.

Output mapping

Defines the mapping for metadata fields on output port 1.

Error mapping

Defines the mapping for metadata fields on output port 2.

Advanced

Create empty files

If set to false, prevents the component from creating an empty output file when there are no input records.

true (default) | false

Create directories

If set to true, non-existing directories in the File URL attribute path are created.

false (default) | true

Compression type

Type of compression used for the output file.

snappy (default) | gzip | uncompressed

Row group size

Row group size of the Parquet file format.

See Apache Parquet documentation for details.

512MB (default)

Page size

Data page size of the Parquet file format.

See Apache Parquet documentation for details.

8KB (default)

Format version

Parquet Data page format version.

Format version v1 offers better compatibility with other Parquet implementations (where v2 format might not be implemented). Format version v2 produces smaller output files.

v1 (default) | v2

Details

The Parquet format attributes Row group size and Page size influence how the data is organized inside the output Parquet file. These attributes can be fine-tuned to optimize output file size or write performance, more details can be found in the Apache Parquet documentation.

The component holds up to Row group size in heap memory at once.

The component always overwrites existing target files. Appending to an existing file is not supported.

Parquet Schema

The component attribute Parquet schema can be used for customization of CloverDX to Parquet data type conversion. The customization is done in a Parquet schema mapping dialog, shown in Parquet file schema dialog.

In this dialog, you can see all the metadata fields, their types and a target Parquet type. This is an abstraction above Parquet primitive and logical types (as described in Apache Parquet documentation). The mapping of Parquet types to primitive and logical types is described in table Parquet Types. The Parquet type selection offers only those types compatible with the specific CloverDX field data type.

Figure 372. Parquet file schema dialog

The implicit auto-mapped types are shown in grey italic font, in contrast to manually mapped types, shown in black. If the metadata contains a field with an unsupported data type (e.g. a map), it is shown as Unsupported also in the mapping dialog and the Parquet type selection is disabled. For Integer Parquet type, the dialog shows the bit width also in the main table for quick overview.

The Parquet schema is stored as a JSON in the job XML. This allows for better readability of the mapping and tracking of changes.

Table 53. Parquet Types
Parquet Type	Primitive Type	Logical Type	Properties
String	BYTE_ARRAY	STRING
Enum	BYTE_ARRAY	ENUM
Integer	INT32 (width <= 32) INT64 (width == 64)	INT(width, signed)	Bit width (8/16/32/64)
Decimal	INT32 (1 <= precision <= 9) INT64 (10 <= precision <= 18) BYTE_ARRAY (19 <= precision)	DECIMAL	Precision (>=0) Scale (>=0)
Date	INT32	DATE
Time	INT32 (precision == millis) INT64 (precision == micros/nanos)	TIME	Precision (millis/micros/nanos) UTC adjustment (true/false)
Timestamp	INT64	TIMESTAMP	Precision (millis/micros/nanos) UTC adjustment (true/false)
Interval	FIXED_LEN_BYTE_ARRAY	INTERVAL
BSON	BYTE_ARRAY	BSON
JSON	BYTE_ARRAY	JSON
Double	DOUBLE
Float	FLOAT
Boolean	BOOLEAN
Binary	BYTE_ARRAY (length is empty) FIXED_LEN_BYTE_ARRAY (length is set)		Length (empty or >0)

Limitations

Writing of nested types (lists and maps) are not supported.

The unsigned INT types and the INT96 primitive type are not supported.

The UUID logical type is not supported.

Partitioning (by value, or by record count) is not supported.

Compatibility

Version	Compatibility Notice
5.10.0	ParquetWriter is available since 5.10.0.

Version

Compatibility Notice

5.10.0

ParquetWriter is available since 5.10.0.

ParquetWriter

Short Description

Ports

Metadata

ParquetWriter Attributes

Details

Parquet Schema

Limitations

Compatibility

See also

External links