You must configure Greenplum Database client host access and role privileges and attributes before using the VMware Greenplum Connector for Apache Spark to transfer data between your Greenplum Database and Spark clusters.
Once you start running Spark applications that use the Connector, you may be required to perform certain Greenplum Database maintenance tasks.
These Greenplum Database configuration and maintenance tasks, described below, must be performed by a Greenplum user with administrative (SUPERUSER
) privileges unless otherwise noted.
Configuring Greenplum Database
Client Host Access
You must explicitly configure Greenplum Database to permit access from all Spark nodes and stand-alone clients. Configure access for each Spark node, Greenplum database, and Greenplum Database role combination in the pg_hba.conf
file on the master node.
Refer to Configuring Client Authentication in the Greenplum Database documentation for detailed information on configuring pg_hba.conf
.
Role Privileges
The Connector uses JDBC to communicate with the Greenplum Database master node. The Greenplum user/role name that you provide when you use the Connector to transfer data between Greenplum Database and Spark must have certain privileges assigned by the administrator:
The user/role must have
SELECT
permission on the following Greenplum Database catalogs:information_schema.tables
,pg_attribute
,pg_class
,pg_namespace
,gp_segment_configuration
,pg_stats
,pg_settings
,gp_distributed_xacts
gp_distribution_policy
(Greenplum 5 only)
The user/role must have
USAGE
privilege on each non-public database schema that has tables that the user will read:<db-name>=# GRANT USAGE ON SCHEMA <schema_name> TO <user_name>;
The user/role must have
CREATE
privilege on each non-public database schema in which tables the user will write to reside:<db-name>=# GRANT CREATE ON SCHEMA <schema_name> TO <user_name>;
The user/role must have the
SELECT
privilege on every Greenplum Database table that the user will read into Spark:<db-name>=# GRANT SELECT ON <schema_name>.<table_name> TO <user_name>;
To read a Greeplum Database table into Spark, the user/role must have permission to create writable external tables using the Greenplum Database
gpfdist
protocol:<db-name>=# ALTER USER <user_name> CREATEEXTTABLE(type = 'writable', protocol = 'gpfdist');
If the user/role writing to Greenplum Database is not a database or table owner, the role must have
SELECT
andINSERT
privileges on each existing Greenplum Database table to which the user will write Spark data:<db-name>=# GRANT SELECT, INSERT ON <schema_name>.<table_name> TO <user_name>;
To write Spark data into a Greeplum Database table, the user/role must have permission to create readable external tables using the Greenplum Database
gpfdist
protocol:<db-name>=# ALTER USER <user_name> CREATEEXTTABLE(type = 'readable', protocol = 'gpfdist');
See the Greenplum Database Managing Roles and Privileges documention for further information on assigning privileges to Greenplum Database users.
gpfdists TLS Configuration
TLS can only be used when running Spark jobs on a cluster deployed in Kubernetes.
You can configure the Connector to use a TLS-secured connection between Greenplum Database segments and the edge of the Kubernetes cluster as described in Configuring the Connector When Spark is Deployed in Kubernetes (Beta). When you configure the Connector to use gpfdists
for Spark access in a Kubernetes cluster, the following client certificate files must reside in the $PGDATA/gpfdists
directory on each Greenplum Database segment host:
- The client certificate file,
client.crt
. - The client private key file,
client.key
. - The trusted certificate authorities,
root.crt
.
The Connector requires that these files be in place, regardless of the setting of the verify_gpfdist_cert
Greenplum server configuration parameter.
Refer to the Greenplum Database gpfdists:// Protocol documentation for more information about configuring the client certificates.
Greenplum Database Maintenance Tasks
The Connector uses Greenplum Database external temporary tables to load data between Greenplum and Spark. Maintenance tasks when you use the Connector may include:
- Periodically checking the status of your Greenplum Database catalogs for bloat, and
VACUUM
-ing the catalog as appropriate. Refer to the Greenplum Database System Catalog Maintenance andVACUUM
documentation for further information. - Periodically
ANALYZE
-ing the Greenplum Database tables that applications using the Connector load into Spark. Refer to the Greenplum Database Updating Statistics with ANALYZE andANALYZE
documentation for further information.
In addition to users with Greenplum Database SUPERUSER
privileges, database or table owners may perform the maintenance tasks identified above.
Content feedback and comments