Version

    HadoopWriter

    Short Description
    Ports
    Metadata
    HadoopWriter Attributes
    Details
    Troubleshooting
    See also

    Short Description

    HadoopWriter writes data into Hadoop sequence files.

    ComponentData outputInput portsOutput portsTransformationTransf. requiredJavaCTLAuto-propagated metadata
    HadoopWriterHadoop sequence file10
    no
    no
    no
    no
    no

    Ports

    Port typeNumberRequiredDescriptionMetadata
    Input0
    yes
    For input data recordsAny

    Metadata

    HadoopWriter does not propagate metadata.

    HadoopWriter has no metadata template.

    HadoopWriter Attributes

    AttributeReqDescriptionPossible values
    Basic
    Hadoop connection 

    Hadoop connection with Hadoop libraries containing the Hadoop sequence file writer implementation. If the Hadoop connection ID is specified in a hdfs:// URL in the File URL attribute, the value of this attribute is ignored.

    Hadoop connection ID
    File URL
    yes

    A URL to an output file on HDFS or a local file system.

    URLs without a protocol (i.e. absolute or relative path) or with the file:// protocol are considered to be located on the local file system.

    If the output file should be located on the HDFS, use the URL in form of hdfs://ConnID/path/to/file, where ConnID is the ID of a Hadoop connection (the Hadoop connection component attribute will be ignored), and /path/to/myfile is the absolute path on corresponding HDFS to the file named myfile.

     
    Key field
    yes
    The name of an input record field carrying a key for each written key-value pair. 
    Value field
    yes
    The name of an input record field carrying a value for each written key-value pair. 
    Advanced
    Create empty files 

    If set to false, prevents the component from creating an empty output file when there are no input records.

    true (default) | false

    Details

    HadoopWriter writes data into a special Hadoop sequence file (org.apache.hadoop.io.SequenceFile). These files contain key-value pairs and are used in MapReduce jobs as input/output file formats. The component can write a single file as well as a partitioned file which has to be located on HDFS or a local file system.

    The exact version of the file format created by the HadoopWriter component depends on Hadoop libraries which you supply in the Hadoop connection referenced from the File URL attribute. In general, sequence files created by one version of Hadoop may not be readable by different version.

    When writing to a local file system, additional .crc files are created if the Hadoop connection with default settings is used. That is because, by default, Hadoop interacts with a local file system using org.apache.hadoop.fs.LocalFileSystem which creates checksum files for each written file. When reading such files, checksum is verified. You can disable checksum creation/verification by adding this key-value pair in the Hadoop Parameters of the Hadoop connection: fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem

    For technical details about Hadoop sequence files, see Apache Hadoop Wiki.

    Notes and Limitations

    Currently, writing compressed data is not supported.

    HadoopWriter cannot write lists and maps.

    Troubleshooting

    If you write data to a sequence file on a local file system, you may encounter the following error message in the error log:

    Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified

    or

    Cannot run program "cygpath": CreateProcess error=2, The system cannot find the file specified

    To solve this problem, disable checksum creation/verification using the fs.file.impl=org.apache.hadoop.fs.RawLocalFileSystem Hadoop parameter in Hadoop connection configuration.

    This issue is related to non-POSIX operating systems (MS Windows).