ParquetReader
Short Description
ParquetReader reads data stored in Apache Parquet files.
Component | Data source | Input ports | Output ports | Each to all outputs | Different to different outputs | Transformation | Transf. req. | Java | CTL | Auto-propagated metadata |
---|---|---|---|---|---|---|---|---|---|---|
ParquetReader |
Parquet files |
0-1 |
1-2 |
⨯ |
⨯ |
⨯ |
⨯ |
⨯ |
⨯ |
✓ |
Ports
Port type | Number | Required | Description | Metadata |
---|---|---|---|---|
Input |
0 |
⨯ |
||
Output |
0 |
✓ |
Successfully read records |
|
1 |
⨯ |
Read errors |
When the error port is disconnected, the component fails on first error. When connected, the component provides an error record for every error and continues to run. A common type of error can be, for example, a failure to convert a Parquet data type to a CloverDX data type.
Metadata
ParquetReader propagates metadata when possible - when using a static local file URL.
Schema is automatically read from the specified Parquet file and converted to CloverDX metadata.
Metadata propagation is not available when using port reading or remote protocols.
Metadata on output port 0 can use Autofilling Functions.
ParquetReader Attributes
Attribute | Req | Description | Possible values |
---|---|---|---|
Basic |
|||
File URL |
yes |
URL to a Parquet file to be read. Can be a single static file URL or a wildcard. Port reading is supported. |
Details
The ParquetReader component supports reading from a single Parquet file or multiple Parquet files specified using a file URL wildcard. Supported file URL formats are described in Supported File URL Formats for Readers.
The component also supports Input Port Reading.
The component automatically propagates metadata from the Parquet file schema. If you manually assign metadata, the Parquet columns are mapped to metadata fields by label.
ParquetReader reads only columns (fields) present in the output matadata. Reducing output metadata to contain only the desired fields can have a significant positive impact on read performance (by reducing unnecessary disk or network I/O). |
Check Config
The component’s check config will report warnings when trying to read a Parquet type to an incompatible CloverDX data type (e.g.
Parquet BINARY
into CloverDX integer
).
Warnings are also produced when reading types with loss of precision (e.g Parquet MICROS
, NANOS
dates).
Raw Reading
The ParquetReader component also supports "raw reads", i.e. the ability to read a Parquet column not only to its "natural" type based on logical annotation, but also read the underlying "raw" value into a corresponding CloverDX data type.
Examples of such cases are:
-
Parquet logical type
TIMESTAMP
uses a primitive typeINT64
.By default, this is read into adate
type, but it is possible to be read intolong
as well. -
Parquet logical type
DECIMAL
uses primitive typesINT32
,INT64
orBYTE_ARRAY
(depending on precision).By default, this is reads intodecimal
type, but it is possible to be read intointeger
,long
, orbyte
.
This raw reading can be used to avoid loss of precision when reading e.g. TIMESTAMP
in nanosecond precision.
When reading into a CloverDX date
, there is a precision loss because the date
has millisecond precision.
Reading into long
gives you full internal value without precision loss.
Compression
In addition to uncompressed files, the following types of compression are officially supported:
-
gzip
-
snappy
Limitations
The nested Parquet types (lists and maps) are not supported.
The unsigned INT
types and the INT96
primitive type are not supported.
The UUID
logical type is read as byte
.
Compatibility
Version | Compatibility Notice |
---|---|
5.10.0 |
ParquetReader is available since 5.10.0. |