ParquetReader reads data stored in Apache Parquet files.
|Component||Data source||Input ports||Output ports||Each to all outputs||Different to different outputs||Transformation||Transf. req.||Java||CTL||Auto-propagated metadata|
|Input||0||⨯||For Input Port Reading|
|Output||0||✓||Successfully read records|
When the error port is disconnected, the component fails on first error. When connected, the component provides an error record for every error and continues to run. A common type of error can be, for example, a failure to convert a Parquet data type to a CloverDX data type.
ParquetReader propagates metadata when possible - when using a static local file URL.
Schema is automatically read from the specified Parquet file and converted to CloverDX metadata.
Metadata propagation is not available when using port reading or remote protocols.
Metadata on output port 0 can use Autofilling Functions.
URL to a Parquet file to be read.
Can be a single static file URL or a wildcard. Port reading is supported.
The ParquetReader component supports reading from a single Parquet file or multiple Parquet files specified using a file URL wildcard. Supported file URL formats are described in Supported File URL Formats for Readers.
The component also supports Input Port Reading.
The component automatically propagates metadata from the Parquet file schema. If you manually assign metadata, the Parquet columns are mapped to metadata fields by label.
|ParquetReader reads only columns (fields) present in the output matadata. Reducing output metadata to contain only the desired fields can have a significant positive impact on read performance (by reducing unnecessary disk or network I/O).|
The component's check config will report warnings when trying to read a Parquet type to an incompatible CloverDX data type
BINARY into CloverDX
Warnings are also produced when reading types with loss of precision (e.g Parquet
The ParquetReader component also supports "raw reads", i.e. the ability to read a Parquet column not only to its "natural" type based on logical annotation, but also read the underlying "raw" value into a corresponding CloverDX data type.
Examples of such cases are:
Parquet logical type
TIMESTAMPuses a primitive type
INT64. By default, this is read into a
datetype, but it is possible to be read into
Parquet logical type
DECIMALuses primitive types
BYTE_ARRAY(depending on precision). By default, this is reads into
decimaltype, but it is possible to be read into
This raw reading can be used to avoid loss of precision when reading e.g.
TIMESTAMP in nanosecond precision.
When reading into a CloverDX
date, there is a precision loss because the
date has millisecond precision.
long gives you full internal value without precision loss.
In addition to uncompressed files, the following types of compression are officially supported:
The nested Parquet types (lists and maps) are not supported.
INT types and the
INT96 primitive type are not supported.
UUID logical type is read as
ParquetReader is available since 5.10.0 in incubation mode.