Kafka connect google cloud storage sink - StreamNative Documentation

This is a sink Apache Kafka Connect connector that stores Kafka messages in a Google Cloud Storage (GCS) bucket.

Prerequisites

You must have a GCP project in order to use GCS.

Quick Start

Setup the kcctl client: doc
Create a GCS bucket in your GCP project.

Create a JSON file like the following:

{
    "name": "gcs-sink",
    "config": {
        "connector.class": "io.aiven.kafka.connect.gcs.GcsSinkConnector",
        "tasks.max": "1",
        "topics": "kafka-gcs-input",
        "format.output.type": "json",
        "gcs.bucket.name": "${GCS_BUCKET_NAME}"
    }
}

Run the following command to create the connector:
```
kcctl apply -f <filename>.json
```

Configuration

The GCS Kafka Connect Sink connector is configured using the following properties:

Parameter	Required	Description	Default
connector.class	Yes	The Java class for the GCS Sink connector.
tasks.max	Yes	The maximum number of tasks that should be created for this connector.
topics	No	A comma-separated list of Kafka topics to consume from, Only one of topics or topics.regex should be specified.
topics.regex	No	Regular expression giving topics to consume. Under the hood, the regex is compiled to a java.util.regex.Pattern. Only one of topics or topics.regex should be specified.
gcs.bucket.name	Yes	The name of the GCS bucket where the data will be stored.
gcs.credentials.json	No	The GCP credentials in JSON format. If not provided, the connector will use the default application credentials.
gcs.credentials.path	No	The path to a GCP credentials file. Cannot be set together with “gcs.credentials.json or “gcs.credentials.default.
gcs.credentials.default	No	Whether to connect using default the GCP SDK default credential discovery. When set to null (the default) or false, will fall back to connecting with No Credentials.Cannot be set together with “gcs.credentials.json” or “gcs.credentials.path”.
gcs.object.content.encoding	No	The GCS object metadata value of Content-Encoding.
gcs.endpoint	No	Explicit GCS Endpoint Address, mainly for testing.
gcs.retry.backoff.initial.delay.ms	No	Initial retry delay in milliseconds. The default value is 1000.	1000
gcs.retry.backoff.max.delay.ms	No	Maximum retry delay in milliseconds. The default value is 32000.	32000
gcs.retry.backoff.delay.multiplier	No	Retry delay multiplier. The default value is 2.0.	2.0
gcs.retry.backoff.max.attempts	No	Retry max attempts. The default value is 6.	6
gcs.retry.backoff.total.timeout.ms	No	Retry total timeout in milliseconds. The default value is 50000.	50000
gcs.user.agent	No	A custom user agent used while contacting google.	”Google GCS Sink/3.4.1 (GPN: Aiven;)“
file.name.prefix	No	The prefix to be added to the name of each file put on GCS.
file.name.template	No	The template for file names on GCS. Supports {{ variable }} placeholders for substituting variables. Currently supported variables are topic, partition, and start_offset (the offset of the first record in the file).	`{{topic}}-{{partition:padding=false}}-{{start_offset:padding=false}}`
file.compression.type	No	The compression type used for files put on GCS. The supported values are: ‘none’, ‘gzip’, ‘snappy’, ‘zstd’.	none
file.max.records	No	The maximum number of records to put in a single file. Must be a non-negative integer number. 0 is interpreted as “unlimited”, which is the default.	0
file.name.timestamp.timezone	No	Specifies the timezone in which the dates and time for the timestamp variable will be treated. Use standard shot and long names. Default is UTC.	UTC
file.name.timestamp.source	No	Specifies the the timestamp variable source. Default is wall-clock.	WALLCLOCK
format.output.type	No	The format type of output contentThe supported values are: ‘avro’, ‘csv’, ‘json’, ‘jsonl’, ‘parquet’.	csv
format.output.fields	No	Fields to put into output files. The supported values are: ‘key’, ‘value’, ‘offset’, ‘timestamp’, ‘headers’.	value
format.output.fields.value.encoding	No	The type of encoding for the value field. The supported values are: ‘none’, ‘base64’.	base64
format.output.envelope	No	Whether to enable envelope for entries with single field.	true
errors.deadletterqueue.topic.name	No	The name of the topic to be used as the dead letter queue (DLQ) for messages that result in an error when processed by this sink connector, or its transformations or converters. The topic name is blank by default, which means that no messages are to be recorded in the DLQ.
errors.deadletterqueue.topic.replication.factor	No	Replication factor used to create the dead letter queue topic when it doesn’t already exist.	3
errors.deadletterqueue.context.headers.enable	No	If true, add headers containing error context to the messages written to the dead letter queue. To avoid clashing with headers from the original record, all error context header keys, all error context header keys will start with __connect.errors.	false

​Prerequisites

​Quick Start

​Configuration

Prerequisites

Quick Start

Configuration