GPSS load configuration file for a Kafka data source (version 3).
Synopsis
version: v3
targets:
- gpdb:
host: <host>
port: <greenplum_port>
user: <user_name>
password: <password>
database: <db_name>
work_schema: <work_schema_name>
error_limit: <num_errors> | <percentage_errors>
filter_expression: <filter_string>
tables:
- table: <table_name>
schema: <schema_name>
mode:
# specify a single mode property block (described below)
insert: {}
update:
<mode_specific_property>: <value>
...
merge:
<mode_specific_property>: <value>
...
transformer:
transform: <udf_transform_udf_name>
properties:
<udf_transform_property_name>: <property_value>
...
columns:
- <udf_transform_column_name>
...
mapping:
<target_column_name> : <source_column_name> | <expression>
...
filter: <output_filter_string>
...
sources:
- kafka:
topic: <kafka_topic>
brokers: <kafka_broker_host:broker_port> %, ...%
partitions: (<partition_numbers>)
key_content:
<data_format>:
<column_spec>
<other_props>
value_content:
<data_format>:
<column_spec>
<other_props>
meta:
json:
column:
name: meta
type: json
encoding: <char_set>
transformer:
path: <path_to_plugin_transform_library>
on_init: <plugin_transform_init_name>
transform: <plugin_transform_name>
properties:
<plugin_transform_property_name>: <property_value>
...
rdkafka_prop:
<kafka_property_name>: <kafka_property_value>
...
task:
batch_size:
max_count: <number_of_rows>
interval_ms: <wait_time>
idle_duration_ms: <idle_time>
window_size: <num_batches>
window_statement: <udf_or_sql_to_run>
prepare_statement: <udf_or_sql_to_run>
teardown_statement: <udf_or_sql_to_run>
save_failing_batch: <boolean>
recover_failing_batch: <boolean> (Beta)
consistency: strong | at-least | at-most | none
fallback_offset: earliest | latest
option:
schedule:
max_retries: <num_retries>
retry_interval: <retry_time>
running_duration: <run_time>
auto_stop_restart_interval: <restart_time>
max_restart_times: <num_restarts>
quit_at_eof_after: <clock_time>
alert:
command: <command_to_run>
workdir: <directory>
timeout: <alert_time>
Where the mode_specific_propertys that you can specify for update
and merge
mode follow:
update:
match_columns: [<match_column_names>]
order_columns: [<order_column_names>]
update_columns: [<update_column_names>]
update_condition: <update_condition>
merge:
match_columns: [<match_column_names>]
update_columns: [<update_column_names>]
order_columns: [<order_column_names>]
update_condition: <update_condition>
delete_condition: <delete_condition>
Where data_format, column_spec, and other_props are one of the following blocks
avro:
source_column_name: <column_name>
schema_url: <http://schemareg_host:schemareg_port> %, ...%
bytes_to_base64: <boolean>
schema_ca_on_gpdb: <sr_ca_file_path>
schema_cert_on_gpdb: <sr_cert_file_path>
schema_key_on_gpdb: <sr_key_file_path>
schema_min_tls_version: <minimum_version>
schema_path_on_gpdb: <path_to_file>
binary:
source_column_name: <column_name>
csv:
columns:
- name: <column_name>
type: <column_data_type>
...
delimiter: <delim_char>
quote: <quote_char>
null_string: <nullstr_val>
escape: <escape_char>
force_not_null: <columns>
fill_missing_fields: <boolean>
custom:
columns:
- name: <column_name>
type: <column_data_type>
...
name: <formatter_name>
options:
- <optname>=<optvalue>
...
delimited:
columns:
- name: <column_name>
type: <column_data_type>
...
delimiter: <delimiter_string>
eol_prefix: <prefix_string>
quote: <quote_char>
escape: <escape_char>
json:
column:
name: <column_name>
type: json | jsonb
is_jsonl: <boolean>
newline: <newline_str>
And where you may specify any property value with a template variable that GPSS substitutes at runtime using the following syntax:
<property:> {{<template_var>}}
Description
Version 3 of the GPSS load configuration file is different in both content and format than previous versions of the file. Certain symbols used in the GPSS version 1 and 2 configuration file reference page syntax have different meanings in version 3 syntax:
- Brackets
[]
are literal and are used to specify a list in version 3. They are no longer used to signify the optionality of a property.- Curly braces
{}
are literal and are used to specify YAML mappings in version 3 syntax. They are no longer used with the pipe symbol (|
) to identify a list of choices.
You specify load configuration properties for a Greenplum Streaming Server (GPSS) Kafka load job in a YAML-formatted configuration file. (This reference page uses the name gpkafka-v3.yaml
when referring to this file; you may choose your own name for the file.) Load properties include Greenplum Database connection and data import properties, Kafka broker, topic, and message format information, and properies specific to the GPSS job.
The gpsscli
and gpkafka load
utilities processes the YAML configuration file in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.
Keywords and Values
version Property
- version: v3
- The version of the configuration file. You must specify
version: v3
.
targets:gpdb Properties
- host: host
- The host name or IP address of the Greenplum Database coordinator host.
- port: greenplum_port
- The port number of the Greenplum Database server on the coordinator host.
- user: user_name
- The name of the Greenplum Database user/role. This user_name must have permissions as described in the Configuring Greenplum Database Role Privileges.
- password: password
- The password for the Greenplum Database user/role.
- database: db_name
- The name of the Greenplum database.
- work_schema: work_schema_name
- The name of the Greenplum Database schema in which GPSS creates internal tables. The default work_schema_name is
public
. - error_limit: num_errors | percentage_errors
- The error threshold, specified as either an absolute number or a percentage. GPSS stops running the job when this limit is reached.
- filter_expression: filter_string
- The filter to apply to the input data before GPSS loads the data into Greenplum Database. If the filter evaluates to
true
, GPSS loads the message. If the filter evaluates tofalse
, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more source value, key, or meta column names. - tables:
-
The Greenplum Database tables, and the data that GPSS will load into each.
- table: table_name
- The name of the Greenplum Database table into which GPSS loads the data.
- schema: schema_name
- The name of the Greenplum Database schema in which table_name resides. Optional, the default schema is the
public
schema. - mode:
- The table load mode;
insert
,merge
, orupdate
. The default mode isinsert
.
update
andmerge
are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.- insert:
- Inserts source data into Greenplum.
- update:
- Updates the target table columns that are listed in
update_columns
when the input columns identified inmatch_columns
match the named target table columns and the optionalupdate_condition
is true. - merge:
- Inserts new rows and updates existing rows when:
- columns are listed in
update_columns
, - the
match_columns
target table column values are equal to the input data, and - an optional
update_condition
is specified and met.
Deletes rows when:
- the
match_columns
target table column values are equal to the input data, and - an optional
delete_condition
is specified and met.
New rows are identified when the
match_columns
value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only thematch_columns
andupdate_columns
. If there are multiple newmatch_columns
values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specifyorder_columns
, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.- mode_property_name: value
- The name to value mapping for a mode property. Each
mode
supports one or more of the following properties as specified in the Synopsis. - match_columns: [match_column_names]
- A comma-separated list that specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
- Required when
mode
ismerge
orupdate
. - order_columns: [order_column_names]
- A comma-separated list that specifies the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch,
order_columns
is used withmatch_columns
to determine the input row with the largest value; GPSS uses that row to write/update the target. - Optional. May be specified in
merge
mode
to sort the input data rows. - update_columns: [update_column_names]
- A column-sparated list that specifies the column(s) to update for the rows that meet the
match_columns
criteria and the optionalupdate_condition
. - Required when
mode
ismerge
orupdate
. - update_condition: update_condition
- Specifies a boolean condition, similar to that which you would declare in a
WHERE
clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of amerge
). Optional. - delete_condition: delete_condition
- In
merge
mode
, specifies a boolean condition, similar to that which you would declare in aWHERE
clause, that must be met for GPSS to delete rows in the target table that meet thematch_columns
criteria. Optional. - transformer:
-
Optional. Output data transform block. An output data transformer is a user-defined function (UDF) that transforms the data before it is loaded into Greenplum Database. The semantics of the UDF are transform-specific.
GPSS currently supports specifying only one of the
mapping
or (UDF)transformer
blocks in the load configuration file, not both. - transform: udf_transform_udf_name
- The name of the output transform UDF. GPSS invokes this function for every batch of data it writes to Greenplum Database.
- properties: udf_transform_property_name: property_value
- One or more property name and value pairs that GPSS passes to udf_transform_udf_name.
- columns: udf_transform_column_name
- The name of one or more columns involved in the transform.
- mapping:
- Optional. Overrides the default source-to-target column mapping.
GPSS currently supports specifying only one of the
mapping
or (UDF)transformer
blocks in the load configuration file, not both.When you specify a
mapping
, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide amapping
block.- target_column_name: source_column_name | expression
- target_column_name specifies the target Greenplum Database table column name. GPSS maps this column name to the source column name specified in source_column_name, or to an expression. When you specify an expression, you may provide a value expression that you would specify in the
SELECT
list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on. - filter: output_filter_string
- The filter to apply to the output data before GPSS loads the data into Greenplum Database. If the filter evaluates to
true
, GPSS loads the message. If the filter evaluates tofalse
, the message is dropped. output_filter_string must be a valid SQL conditional expression and may reference one or moreMETA
orVALUE
column names.
sources:kafka: Options
- topic: kafka_topic
- The name of the Kafka topic from which to load data. The topic must exist.
- brokers: kafka_broker_host:broker_port
- A host and port number for each of one or more Kafka brokers.
- partitions: (partition_numbers)
- A single, a comma-separated list, and/or a range of partition numbers from which GPSS reads messages from the Kafka topic. A range that you specify with the
M...N
syntax includes both the range start and end values. By default, GPSS reads messages from all partitions of the Kafka topic.
Ensure that you do not configure multiple jobs that specify overlapping partition numbers in the same topic; GPSS behavior is undefined.
- key_content:
- The Kafka message data type, field names, and type-specific properties. You must specify all Kafka key elements in the order in which they appear in the Kafka message. Optional when you specify a
value_content
block; GPSS ignores the Kafka message key in this circumstance. - value_content:
- The Kafka message value data type, field names, and type-specific properties. You must specify all Kafka data elements in the order in which they appear in the Kafka message. Optional when you specify a
key_content
block; GPSS ignores the Kafka message value in this circumstance.
You must not provide a
value_content
block when you specifycsv
format for thekey_content
block. Similarly, you must not provide akey_content
block when you specifycsv
format for avalue_content
block.
- column_spec
-
The source to Greenplum column mapping. The supported column specification differs for different data formats as described below.
- The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in
source_column_name
,column:name
, orcolumns:name
with a column name in the target Greenplum Databasetable
. You can override the default mapping by specifying amapping:
block. - data_format
- The format of the key or value data. You may specify a data_format of
avro
,binary
,csv
,custom
,delimited
, orjson
for the key and value, with some restrictions. - avro
- When you specify the
avro
data format for a key or value, GPSS reads the data into a singlejson
-type column. You may specify a schema registery location and optional SSL certificates and keys, and whether or not you want GPSS to convertbytes
fields into base64-encoded strings. - source_column_name: column_name
- The name of the single
json
-type column into which GPSS reads the key or value data. - schema_url: schemareg_host:schemareg_port
- When you specify the
avro
format and the Avro schema of the JSON data that you want to load is registered in the Confluent Schema Registry, you must identify the host name and port number of each Confluent Schema Registry server in your Kafka cluster. You may specify more than one address, and at least one of the addresses must be legal. - bytes_to_base64: boolean
- When
true
, GPSS converts Avrobytes
fields into base64-encoded strings. The default value isfalse
, GPSS does not perform the conversion. - schema_ca_on_gpdb: sr_ca_file_path
- The file system path to the CA certificate that GPSS uses to verify the peer. This file must reside in sr_ca_file_path on all Greenplum Database segment hosts.
- schema_cert_on_gpdb: sr_cert_file_path
- The file system path to the client certificate that GPSS uses to connect to the HTTPS schema registry. This file must reside in sr_cert_file_path on all Greenplum Database segment hosts.
- schema_key_on_gpdb: sr_key_file_path
- The file system path to the private key file that GPSS uses to connect to the HTTPS schema registry. This file must reside in sr_key_file_path on all Greenplum Database segment hosts.
- schema_min_tls_version: minimum_version
- The minimum transport layer security (TLS) version that GPSS requests on the connection to the schema registry. Supported versions are
1.0
,1.1
,1.2
, or1.3
. The default minimum TLS version is1.0
. - schema_path_on_gpdb: path_to_file
- When you specify the
avro
format and the Avro schema of the JSON key or value data that you want to load is specified in a separate.avsc
file, you must identify the file system location in path_to_file, and the file must reside in this location on every Greenplum Database segment host.
GPSS does not cache the schema. GPSS must reload the schema for every batch of Kafka data. Also, GPSS supports providing the schema for either the key or the value, but not both.
- binary
- When you specify the
binary
data format, GPSS reads the data into a singlebytea
-type column. - source_column_name: column_name
- The name of the single
bytea
-type column into which GPSS reads the key or value data. - csv
- When you specify the
csv
data format, GPSS reads the data into the list of columns that you specify. The message content cannot contain line ending characters (CR and LF). - columns:
- A set of column name/type mappings. The value
[]
specifies all columns. - name: column_name
- The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
- type: column_data_type
- The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
- delimiter: delim_char
- Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma (
,
). - quote: quote_char
- Specifies the quotation character. Because GPSS does not provide a default value for this property, you must specify a value.
- null_string: nullstr_val
- Specifies the string that represents the null value. Because GPSS does not provide a default value for this property, you must specify a value.
- escape: escape_char
- Specifies the single character that is used for escaping data characters in the content that might otherwise be interpreted as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. Because GPSS does not provide a default value for this property, you must specify a value.
- force_not_null: columns
- Specifies a comma-separated list of column names to process as though each column were quoted and hence not a NULL value. For the default
null_string
(nothing between two delimiters), missing values are evaluated as zero-length strings. - fill_missing_fields: boolean
- Specifies the action of GPSS when it reads a row of data that has missing trailing field values (the row has missing data fields at the end of a line or row). The default value is
false
, GPSS returns an error when it encounters a row with missing trailing field values.
If set to true
, GPSS sets missing trailing field values to NULL
. Blank rows, fields with a NOT NULL
constraint, and trailing delimiters on a line will still generate an error.
- custom
- When you specify the
custom
data format, GPSS uses the custom formatter that you specify to process the input data before writing it to Greenplum Database. - columns:
- A set of column name/type mappings. The value
[]
specifies all columns. - name: column_name
- The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
- type: column_data_type
- The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
- name: formatter_name
- When you specify the
custom
data format, formatter_name is required and must identify the name of the formatter user-defined function that GPSS should use when loading the data. - options:
- A set of function argument name=value pairs.
- optname=optvalue
- The name and value of the set of arguments to pass into the formatter_name UDF.
- delimited
- When you specify the
delimited
data format, GPSS reads the data into the list of columns that you specify. You must specify the datadelimiter
. - columns:
- A set of column name/type mappings. The value
[]
specifies all columns. - name: column_name
- The name of a key or value column. column_name must match the column name of the target Greenplum Database table.
- type: column_data_type
- The data type of the column. You must specify an equivalent data type for each data element and the associated Greenplum Database table column.
- delimiter: delimiter_string
- When you specify the
delimited
data format, delimiter_string is required and must identify the data element delimiter. delimiter_string may be a multi-byte value, and up to 32 bytes in length. It may not contain quote and escape characters. - eol_prefix: prefix_string
- Specifies the prefix before the end of line character (
\n
) that indicates the end of a row. The default prefix is empty. - quote: quote_char
- Specifies the single ASCII quotation character. The default quote character is empty.
- If you do not specify a quotation character, GPSS assumes that all columns are unquoted. If you do not specify a quotation character and do specify an escape character, GPSS assumes that all columns are unquoted and escapes the delimiter, end-of-line prefix, and escape itself.
- When you specify a quotation character, you must specify an escape character. GPSS reads any content between quote characters as-is, except for escaped characters.
- escape: escape_char
- Specifies the single ASCII character used to escape special characters (for example, the
delimiter
,eol_prefix
,quote
, orescape
itself). Therdefault escape character is empty. - When you specify an escape character and do not specify a quotation character, GPSS escapes only the delimiter, end-of-line prefix, and escape itself.
- When you specify both an escape character and a quotation character, GPSS escapes only these characters.
- json
- When you specify the
json
data format, GPSS can read the data as a single JSON object or as a single JSON record per line. - column:
- A single column name/type mapping.
- name: column_name
- The name of the key or value column. column_name must match the column name of the target Greenplum Database table.
- type: json | jsonb | gp_jsonb (Beta) | gp_json (Beta)
- The data type of the column.
- is_jsonl: boolean
- Identifies whether or not GPSS reads the JSON data as a single object or single-record-per-line. The default is
false
, GPSS reads the JSON data as a single object. - newline: newline_str
- A string that specifies the new line character(s) that end each JSON record. The default newline is
"\n"
. - meta:
-
The data type and field name of the Kafka meta data.
meta:
must specify thejson
orjsonb
(Greenplum 6 only) data format, and a singlejson
-type column. The available Kafka meta data properties include:topic
(text) - the Kafka topic namepartition
(int) - the partition numberoffset
(bigint) - the record location within the partitiontimestamp
(bigint) - the time that the message was appended to the Kafka log
You can load any of these properties into the target table with a
mapping
, or use a property in the update or merge criteria for a load operation. - encoding: char_set
- The source data encoding. You can specify an encoding character set when the source data is of the
csv
,custom
,delimited
, orjson
format. GPSS supports the character sets identified in Character Set Support in the Tanzu Greenplum documentation. - transformer:
-
Input data transform block. An input data transformer is a plugin, set of
go
functions that transform the data after it is read from the source. The semantics of the transform are function-specific. You specify the library and function names in this block, as well as the properties that GPSS passes to these functions:- path: path_to_plugin_transform_library
- The file system location of the plugin transformer library on the Greenplum Streaming Server server host.
- on_init: plugin_transform_init_name
- The name of an initialization function that GPSS calls when it loads the transform library.
- transform: plugin_transform_name
- The name of the transform function. GPSS invokes this function for every message it reads.
- properties: plugin_transform_property_name: property_value
- One or more property name and value pairs that GPSS passes to plugin_transform_init_name and plugin_transform_name.
- rdkafka_prop:
-
Kafka consumer configuration property names and values.
- kafka_property_name
- The name of a Kafka property.
- kafka_property_value
- The Kafka property value.
- task:
-
The batch size and commit window.
- batch_size:
- Controls how GPSS commits data to Greenplum Database. You may specify both
max_count
andinterval_ms
as long as both values are not zero (0
). Try setting and tuninginterval_ms
to your environment; introduce amax_count
setting only if you encounter high memory usage associated with message buffering. - max_count: number_of_rows
- The number of rows to batch before triggering an
INSERT
operation on the Greenplum Database table. The default value ofmax_count
is0
, which instructs GPSS to ignore this commit trigger condition. - interval_ms: wait_time
- The minimum amount of time to wait (milliseconds) between each
INSERT
operation on the table. The default value is5000
. - idle_duration_ms: idle_time
- The maximum amount of time to wait (milliseconds) for the first batch of Kafka data. When you use this property to enable lazy load, GPSS waits until Kafka data is available before locking the target Greenplum table. You can specify:
0
(lazy load is deactivated)-1
(lazy load is activated, the job never stops), or- a positive value (lazy load is activated, the job stops after idle_time duration of no data in the Kafka topic)
The default value is
0
.- window_size: num_batches
- The number of batches to read before running
window_statement
. The default batch interval is 0. - window_statement: udf_or_sql_to_run
- A user-defined function or SQL command(s) that you want to run after GPSS reads
window_size
number of batches. The default is null, no command to run. - prepare_statement: udf_or_sql_to_run
- A user-defined function or SQL command(s) that you want GPSS to run before it executes the job. The default is null, no command to run.
- teardown_statement: udf_or_sql_to_run
- A user-defined function or SQL command(s) that you want GPSS to run after the job stops. GPSS runs the function or command(s) on job success and job failure. The default is null, no command to run.
- save_failing_batch: boolean
- Determines whether or not GPSS saves data into a backup table before it writes the data to Greenplum Database. Saving the data in this manner aids recovery when GPSS encounters errors during the evaluation of expressions. The default is
false
; GPSS does not use a backup table, and returns immediately when it encounters an expression error. When you set this property totrue
, GPSS writes both the good and the bad data in the batch to a backup table namedgpssbackup_<jobhash>
, and continues to process incoming data. You must then manually load the good data from the backup table into Greenplum or setrecover_failing_batch
(Beta) totrue
to have GPSS automatically reload the good data.
Using a backup table to hedge against mapping errors may impact performance, especially when the data that you are loading has not been cleaned.
- recover_failing_batch: boolean (Beta)
- When set to
true
andsave_failing_batch
is alsotrue
, GPSS automatically reloads the good data in the batch and retains only the error data in the backup table. The default value isfalse
; GPSS does not process the backup table.
Enabling this property requires that GPSS has the Greenplum Database privileges to create a function.
- consistency: strong | at-least | at-most | none
- Specify how GPSS should manage message offsets when it acts as a high-level Kafka consumer. Valid values are
strong
,at-least
,at-most
, andnone
. The default value isstrong
. Refer to Understanding Kafka Message Offset Management for more detailed information. - fallback_offset: earliest | latest
- Specifies the behaviour of GPSS when it detects a Kafka message offset gap. When set to
earliest
, GPSS automatically resumes a load operation from the earliest available published message. When set tolatest
, GPSS loads only new messages to the Kafka topic. If this property is not set, GPSS returns an error.
option: Properties
- schedule:
-
Controls the frequency and interval of restarting jobs.
- retry_interval: retry_time
- The period of time that GPSS waits before retrying a failed job. You can specify the time interval in day (
d
), hour (h
), minute (m
), second (s
), or millisecond (ms
) integer units; do not mix units. The default retry interval is5m
(5 minutes). - max_retries: num_retries
- The maximum number of times that GPSS attempts to retry a failed job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.
- running_duration: run_time
- The amount of time after which GPSS automatically stops a job. GPSS does not automatically stop a job by default.
- auto_stop_restart_interval: restart_time
- The amount of time after which GPSS restarts a job that it stopped due to reaching
running_duration
. - max_restart_times: num_restarts
- The maximum number of times that GPSS restarts a job that it stopped due to reaching
running_duration
. The default is 0, do not restart the job. - quit_at_eof_after: clock_time
- The clock time after which GPSS stops a job every day when it encounters an EOF. By default, GPSS does not automatically stop a job that reaches EOF. GPSS never stops a job when the current time is before
clock_time
, even when GPSS encounters an EOF.
- alert:
-
Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).
- command: command_to_run
- The program that the GPSS server runs on the GPSS server host, including arguments. The command must be executable by GPSS.
- command_to_run has access to job-related environment variables that GPSS sets, including:
$GPSSJOB_NAME
,$GPSSJOB_STATUS
, and$GPSSJOB_DETAIL
. - workdir: directory
- The working directory for command_to_run. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
- timeout: alert_time
- The amount of time after a job stops, prompting GPSS to trigger the alert (and run command_to_run). You can specify the time interval in day (
d
), hour (h
), minute (m
), or second (s
) integer units; do not mix units. The default alert timeout is-1s
(no timeout).
Template Variables
GPSS supports using template variables to specify property values in the load configuration file.
You specify a template variable value in the load configuration file as follows:
<property>: {{<template_var>}}
For example:
max_retries: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value>
option to the gpsscli dryrun
, gpsscli submit
, gpsscli load
, or gpkafka load
command.
For example, if the command line specifies:
--property numretries=10
GPSS substitutes occurrences of {{numretries}}
in the load configuration file with the value 10
before submitting the job, and uses that value while the job is running.
Notes
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" (c1 text);
Your YAML configuration file would refer to the table name as:
targets:
- gpdb:
tables:
- table: '"MyTable"'
You can specify backslash escape sequences in the CSV delimiter
, quote
, and escape
options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x
). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.
Kafka Properties
GPSS requires Kafka version 0.11 or newer for exactly-once delivery assurance. You can run with an older version of Kafka (but lose the exactly-once guarantee) by adding the following rdkafka_prop
block to your gpkafka-v3.yaml
load configuration file:
rdkafka_prop:
api.version.request: false
broker.version.fallback: 0.8.2.1
Examples
Load data from Kafka as defined in the Version 3 configuration file named loadfromkafka_v3.yaml
:
gpkafka load loadfromkafka_v3.yaml
Example loadfromkafka_v3.yaml
configuration file:
version: v3
targets:
- gpdb:
host: mdw-1
port: 15432
user: gpadmin
password: changeme
database: testdb
work_schema: public
error_limit: 25
tables:
- table: tbl_order_merge
schema: public
mode:
insert {}
mapping:
data: (value->>'data')::text
o: (meta->>'offset')::bigint
p: (meta->>'partition')::int
pk: (value->>'pk')::int
ts: (meta->>'timestamp')::bigint
sources:
- kafka:
topic: daily_orders
brokers: localhost:9092
key_content:
binary:
source_column_name: key
value_content:
json:
column:
name: value
type: JSON
meta:
json:
column:
name: meta
type: JSON
task:
batch_size:
interval_ms: 5000
max_count: 1
window_size: 5
option:
schedule:
running_duration: 2s
auto_stop_restart_interval : 2s
max_restart_times : 1
Content feedback and comments