BigQuery Connector integrates Apache Pulsar with Google BigQuery.
Data can only be synchronized to Standard BigQuery tables. External tables and View are not supported.
builtin
connector. If you want to create a non-builtin
connector,
you need to replace --sink-type bigquery
with --archive /path/to/pulsar-io-bigquery.nar
. You can find the button to download the nar
package at the beginning of the document.
--sink-config
is the minimum necessary configuration for starting this connector, and it is a JSON string. You need to substitute the relevant parameters with your own.
If you want to configure more parameters, see Configuration Properties for reference.
pulsar-admin
are similar to those of pulsarctl
. You can find an example for StreamNative Cloud Doc.Name | Type | Required | Sensitive | Default | Description |
---|---|---|---|---|---|
projectId | String | Yes | false | "" (empty string) | The Google BigQuery project ID. |
datasetName | String | Yes | false | "" (empty string) | The Google BigQuery dataset name. |
tableName | String | Yes | false | "" (empty string) | The Google BigQuery table name. |
credentialJsonString | String | Yes | true | "" (empty string) | The authentication JSON key. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the path of the JSON file that contains your service account key when the credentialJsonString is set to an empty string. For details, see the Google documentation. |
visibleModel | String | No | false | ”Committed” | The mode that controls when data written to the stream becomes visible in BigQuery for reading. For details, see the Google documentation. Available options are Committed and Pending . |
pendingMaxSize | int | No | false | 10000 | The maximum number of messages waiting to be committed in Pending mode. |
batchMaxSize | int | No | false | 20 | The maximum number of batch messages. The actual batch bytes size cannot exceed 10 MB. If it does, the batch will be flushed first. https://cloud.google.com/bigquery/quotas |
batchMaxTime | long | No | false | 5000 | The maximum batch waiting time (in units of milliseconds). |
batchFlushIntervalTime | long | No | false | 2000 | The batch flush interval (in units of milliseconds). |
failedMaxRetryNum | int | No | false | 20 | The maximum retries when appending fails. By default, it sets 2 seconds for each retry. |
autoCreateTable | boolean | No | false | true | Automatically create a table if no table is available. |
autoUpdateTable | boolean | No | false | true | Automatically update the table schema if the BigQuery table schema is incompatible with the Pulsar schema. |
partitionedTables | boolean | No | false | true | Create a partitioned table when the table is automatically created. It will use the __event_time__ as the partition key. |
partitionedTableIntervalDay | int | No | false | 7 | The number of days between partitioning of the partitioned table. |
clusteredTables | boolean | No | false | true | Create a clustered table when the table is automatically created. It will use the __message_id__ as the cluster key. |
defaultSystemField | String | No | false | "" (empty string) | Create the system fields when the table is automatically created. You can use commas to separate multiple fields. The supported system fields are: __schema_version__ , __partition__ , __event_time__ , __publish_time__ , __message_id__ , __sequence_id__ , __producer_name__ and __properties__ . The __properties__ will be a repeat struct on bigquery. key and value will as a string type.. |
at-most-once
, at-least-once
, and effectively-once
.
Currently, the Google Cloud BigQuery sink connector only provides the at-least-once
delivery guarantee.
Schema | Supported |
---|---|
AVRO | Yes |
PRIMITIVE | Yes |
PROTOBUF_NATIVE | Yes |
PROTOBUF | No |
JSON | No |
KEY_VALUE | No |
autoCreateTable
is set to true
. If you create a table manually, you need to manually specify the partition key.autoCreateTable
is set to true
. If you create a table manually, you need to manually specify the cluster key.