GPSS load configuration file for an s3 data source (version 3).
Synopsis
version: v3
targets:
- gpdb:
host: <host>
port: <greenplum_port>
user: <user_name>
password: <password>
database: <db_name>
work_schema: <work_schema_name>
error_limit: <num_errors> | <percentage_errors>
filter_expression: <filter_string>
tables:
- table: <table_name>
schema: <schema_name>
mode:
# specify a single mode property block (described below)
insert: {}
update:
<mode_specific_property>: <value>
...
merge:
<mode_specific_property>: <value>
...
mapping:
<target_column_name> : <source_column_name> | <expression>
...
filter: <output_filter_string>
...
sources:
- s3:
uri:
- <s3_file_path>
...
content:
csv:
<column_spec>
<other_props>
encoding: <char_set>
s3param:
version: <cfg_version>
accessid: <s3_access_id>
secret: <s3_secret>
chunksize: <seg_buf_size>
threadnum: <max_num>
gpcheckcloud_newline: <newline_char>
autocompress: <boolean>
encryption: <boolean>
proxy: <url>
verifycert: <boolean>
server_side_encryption: <boolean>
low_speed_limit: <bps_limit>
low_speed_time: <wait_secs>
option:
schedule:
max_retries: <num_retries>
retry_interval: <retry_time>
running_duration: <run_time>
auto_stop_restart_interval: <restart_time>
max_restart_times: <num_restarts>
quit_at_eof_after: <clock_time>
alert:
command: <command_to_run>
workdir: <directory>
timeout: <alert_time>
Where the mode_specific_propertys that you can specify for update
and merge
mode follow:
update:
match_columns: [<match_column_names>]
order_columns: [<order_column_names>]
update_columns: [<update_column_names>]
update_condition: <update_condition>
merge:
match_columns: [<match_column_names>]
update_columns: [<update_column_names>]
order_columns: [<order_column_names>]
update_condition: <update_condition>
delete_condition: <delete_condition>
And where you may specify any property value with a template variable that GPSS substitutes at runtime using the following syntax:
<property:> {{<template_var>}}
Description
Version 3 of the GPSS load configuration file is different in both content and format than previous versions of the file. Certain symbols used in the GPSS version 1 and 2 configuration file reference page syntax have different meanings in version 3 syntax:
- Brackets
[]
are literal and are used to specify a list in version 3. They are no longer used to signify the optionality of a property.- Curly braces
{}
are literal and are used to specify YAML mappings in version 3 syntax. They are no longer used with the pipe symbol (|
) to identify a list of choices.
You specify the configuration properties for a VMware Tanzu Greenplum streaming server (GPSS) s3 load job in a YAML-formatted configuration file that you provide to the gpsscli submit
or gpsscli load
commands. There are three types of configuration properties in this file - those that identify the Tanzu Greenplum connection and target table, properties specific to the s3 data source that you will load into Greenplum, and job-related properties.
This reference page uses the name s3source-v3.yaml
to refer to this file; you may choose your own name for the file.
The gpsscli
utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant. Keywords are not case-sensitive.
Keywords and Values
version Property
- version: v3
- The version of the configuration file. You must specify
version: v3
.
targets:gpdb Properties
- host: host
- The host name or IP address of the Tanzu Greenplum coordinator host.
- port: greenplum_port
- The port number of the Tanzu Greenplum server on the coordinator host.
- user: user_name
- The name of the Tanzu Greenplum user/role. This user_name must have permissions as described in the Configuring VMware Tanzu Greenplum Role Privileges.
- password: password
- The password for the Tanzu Greenplum user/role.
- database: db_name
- The name of the Tanzu Greenplum.
- work_schema: work_schema_name
- The name of the Tanzu Greenplum schema in which GPSS creates internal tables. The default work_schema_name is
public
. - error_limit: num_errors | percentage_errors
- The error threshold, specified as either an absolute number or a percentage. GPSS stops running the job when this limit is reached.
- filter_expression: filter_string
- The filter to apply to the input data before GPSS loads the data into Tanzu Greenplum. If the filter evaluates to
true
, GPSS loads the message. If the filter evaluates tofalse
, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more source value, key, or meta column names. - tables:
-
The Tanzu Greenplum tables, and the data that GPSS will load into each.
- table: table_name
- The name of the Tanzu Greenplum table into which GPSS loads the data.
- schema: schema_name
- The name of the Tanzu Greenplum schema in which table_name resides. Optional, the default schema is the
public
schema. - mode:
- The table load mode;
insert
,merge
, orupdate
. The default mode isinsert
.
update
andmerge
are not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.- insert:
- Inserts source data into Greenplum.
- update:
- Updates the target table columns that are listed in
update_columns
when the input columns identified inmatch_columns
match the named target table columns and the optionalupdate_condition
is true. - merge:
- Inserts new rows and updates existing rows when:
- columns are listed in
update_columns
, - the
match_columns
target table column values are equal to the input data, and - an optional
update_condition
is specified and met.
Deletes rows when: - the
match_columns
target table column values are equal to the input data, and - an optional
delete_condition
is specified and met.
New rows are identified when thematch_columns
value in the source data does not have a corresponding value in the existing data of the target table. In those cases, the entire row from the source file is inserted, not only thematch_columns
andupdate_columns
. If there are multiple newmatch_columns
values in the input data that are the same, GPSS inserts or updates the target table using a random matching input row. When you specifyorder_columns
, GPSS sorts the input data on the specified column(s) and inserts or updates from the input row with the largest value.
- mode_property_name: value
- The name to value mapping for a mode property. Each
mode
supports one or more of the following properties as specified in the Synopsis. - match_columns: [match_column_names]
- A comma-separated list that specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
- Required when
mode
ismerge
orupdate
. - order_columns: [order_column_names]
- A comma-separated list that specifies the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch,
order_columns
is used withmatch_columns
to determine the input row with the largest value; GPSS uses that row to write/update the target. - Optional. May be specified in
merge
mode
to sort the input data rows. - update_columns: [update_column_names]
- A column-sparated list that specifies the column(s) to update for the rows that meet the
match_columns
criteria and the optionalupdate_condition
. - Required when
mode
ismerge
orupdate
. - update_condition: update_condition
- Specifies a boolean condition, similar to that which you would declare in a
WHERE
clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of amerge
). Optional. - delete_condition: delete_condition
- In
merge
mode
, specifies a boolean condition, similar to that which you would declare in aWHERE
clause, that must be met for GPSS to delete rows in the target table that meet thematch_columns
criteria. Optional. - mapping:
- Optional. Overrides the default source-to-target column mapping.
: > Note When you specify a
mapping
, ensure that you provide a mapping for all source data elements of interest. GPSS does not automatically match column names when you provide amapping
block.target\_column\_name: source\_column\_name \| expression : target\_column\_name specifies the target Tanzu Greenplum table column name. GPSS maps this column name to the source column name specified in source\_column\_name, or to an expression. When you specify an expression, you may provide a value expression that you would specify in the `SELECT` list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
- filter: output_filter_string
- The filter to apply to the output data before GPSS loads the data into Tanzu Greenplum. If the filter evaluates to
true
, GPSS loads the message. If the filter evaluates tofalse
, the message is dropped. output_filter_string must be a valid SQL conditional expression and may reference one or moreMETA
orVALUE
column names.
sources:s3: Options
- uri
-
The path to an s3 file or bucket.
- s3_file_path
- A URL identifying a file or files on s3 to be loaded. You can specify wildcards in any element of the path. To load all files in a directory, specify
dirname/*
.
- content:
-
The file type, field names, and type-specific properties of the file data. You must specify all data elements in the order in which they appear in the file. And you must specify
csv
format to read CSV- or text-format data.- csv
- When you specify the
csv
data format, GPSS reads the data into the list of columns that you specify. The data content cannot contain line ending characters (CR and LF). - column_spec
- The source to Greenplum column mapping. The supported column specification differs for different data formats as described below.
- The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in
source_column_name
,column:name
, orcolumns:name
with a column name in the target Tanzu Greenplumtable
. You can override the default mapping by specifying amapping:
block. - columns:
- A set of column name/type mappings. The value
[]
specifies all columns. - name: column_name
- The name of a key or value column. column_name must match the column name of the target Tanzu Greenplum table.
- type: column_data_type
- The data type of the column. You must specify an equivalent data type for each data element and the associated Tanzu Greenplum table column.
- delimiter: delim_char
- Specifies a single ASCII character that separates columns within each message or row of data. The default delimiter is a comma (
,
).
- encoding: char_set
- The source data encoding. You can specify an encoding character set when the source data is of the
csv
,custom
,delimited
, orjson
format. GPSS supports the character sets identified in Character Set Support in the Tanzu Greenplum documentation. - s3param
-
The configuration parameters for the s3 data source are the same as those identified for the Tanzu Greenplum
s3
protocol. Refer to the Tanzu Greenplum s3 Protocol Configuration File documentation for the use and description of these parameters.You can specify the s3 configuration parameters individually in the GPSS load configuration file. Alternatively, you can choose to provide the
config=<filepath>
option in the s3_file_path URI to specify the absolute path of a file on the local file system that contains the configuration parameter settings.
option: Properties
- schedule:
-
Controls the frequency and interval of restarting jobs.
- retry_interval: retry_time
- The period of time that GPSS waits before retrying a failed job. You can specify the time interval in day (
d
), hour (h
), minute (m
), second (s
), or millisecond (ms
) integer units; do not mix units. The default retry interval is5m
(5 minutes). - max_retries: num_retries
- The maximum number of times that GPSS attempts to retry a failed job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.
- running_duration: run_time
- The amount of time after which GPSS automatically stops a job. GPSS does not automatically stop a job by default.
- auto_stop_restart_interval: restart_time
- The amount of time after which GPSS restarts a job that it stopped due to reaching
running_duration
. - max_restart_times: num_restarts
- The maximum number of times that GPSS restarts a job that it stopped due to reaching
running_duration
. The default is 0, do not restart the job. If you specify the value-1
, GPSS restarts the job indefinitely. You may usegpsscli stop
to stop the jobs from being restarted indefinitely. - quit_at_eof_after: clock_time
- The clock time after which GPSS stops a job every day when it encounters an EOF. By default, GPSS does not automatically stop a job that reaches EOF. GPSS never stops a job when the current time is before
clock_time
, even when GPSS encounters an EOF.
- alert:
-
Controls notification when a job is stopped for any reason (success, completion, error, user-initiated stop).
- command: command_to_run
- The program that the GPSS server runs on the GPSS server host, including arguments. The command must be executable by GPSS.
- command_to_run has access to job-related environment variables that GPSS sets, including:
$GPSSJOB_NAME
,$GPSSJOB_STATUS
, and$GPSSJOB_DETAIL
. - workdir: directory
- The working directory for command_to_run. The default working directory is the directory from which you started the GPSS server process. If you specify a relative path, it is relative to the directory from which you started the GPSS server process.
- timeout: alert_time
- The amount of time after a job stops, prompting GPSS to trigger the alert (and run command_to_run). You can specify the time interval in day (
d
), hour (h
), minute (m
), or second (s
) integer units; do not mix units. The default alert timeout is-1s
(no timeout).
Template Variables
GPSS supports using template variables to specify property values in the load configuration file.
You specify a template variable value in the load configuration file as follows:
<property>: {{<template_var>}}
For example:
max_retries: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property <template_var=value>
option to the gpsscli dryrun
, gpsscli submit
, or gpsscli load
command.
For example, if the command line specifies:
--property numretries=10
GPSS substitutes occurrences of {{numretries}}
in the load configuration file with the value 10
before submitting the job, and uses that value while the job is running.
Notes
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the load configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" (c1 text);
Your YAML configuration file would refer to the table name as:
targets:
- gpdb:
tables:
- table: '"MyTable"'
You can specify backslash escape sequences in the CSV delimiter
, quote
, and escape
options. GPSS supports the standard backslash escape sequences for backspace, form feed, newline, carriage return, and tab, as well as escape sequences that you specify in hexadecimal format (prefaced with \x
). Refer to Backslash Escape Sequences in the PostgreSQL documentation for more information.
Examples
Submit a job to load data from a CSV file on s3 as defined in the v3 load configuration file named loadfroms3_v3.yaml
:
$ gpsscli submit loadfroms3_v3.yaml
Example loadfroms3_v3.yaml
configuration file:
version: v3
targets:
- gpdb:
host: mdw-1
port: 15432
user: gpadmin
password: changeme
database: testdb
work_schema: public
error_limit: "25"
tables:
- table: orders
schema: public
mode:
merge:
match_columns: [pk]
order_columns: [seq]
delete_condition: flag = 0
mapping:
data: data
pk: pk
seq: seq
sources:
- s3:
uri:
- "s3://s3-us-east-1.amazonaws.com/mydir/mybucket/data0000"
content:
csv:
columns:
- name: pk
type: int
- name: seq
type: int
- name: data
type: text
- name: flag
type: int
delimiter: ";"
s3params:
version: 1
accessid: 123
secret: 456
chunksize: 4096
threadnum: 4
gpcheckcloud_newline: "\n"
autocompress: false
encryption: false
verifycert: false
low_speed_limit: 1
option:
schedule: {}
Content feedback and comments