Apache Hive is a distributed data warehousing infrastructure. Hive facilitates managing large data sets supporting multiple data formats, including comma-separated value (.csv) TextFile, RCFile, ORC, and Parquet.
The PXF Hive connector reads data stored in a Hive table. This section describes how to use the PXF Hive connector.
hive[:*]
profiles described below to access Hive 3 external tables only. The Connector does not support using the hive[:*]
profiles to access Hive 3 managed (CRUD and insert-only transactional, and temporary) tables. Use the PXF JDBC Connector to access Hive 3 managed tables instead.Prerequisites
Before working with Hive table data using PXF, ensure that you have met the PXF Hadoop Prerequisites.
If you plan to use PXF filter pushdown with Hive integral types, ensure that the configuration parameter hive.metastore.integral.jdo.pushdown
exists and is set to true
in the hive-site.xml
file in both your Hadoop cluster and $PXF_BASE/servers/default/hive-site.xml
. Refer to About Updating Hadoop Configuration for more information.
Hive Data Formats
The PXF Hive connector supports several data formats, and has defined the following profiles for accessing these formats:
File Format | Description | Profile |
---|---|---|
TextFile | Flat file with data in comma-, tab-, or space-separated value format or JSON notation. | hive, hive:text |
SequenceFile | Flat file consisting of binary key/value pairs. | hive |
RCFile | Record columnar data consisting of binary key/value pairs; high row compression rate. | hive, hive:rc |
ORC | Optimized row columnar data with stripe, footer, and postscript sections; reduces data size. | hive, hive:orc |
Parquet | Compressed columnar data representation. | hive |
Avro | Serialization system with a binary data format. | hive |
Note: The hive
profile supports all file storage formats. It will use the optimal hive[:*]
profile for the underlying file format type.
Data Type Mapping
The PXF Hive connector supports primitive and complex data types.
Primitive Data Types
To represent Hive data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type.
The following table summarizes external mapping rules for Hive primitive types.
Hive Data Type | Greenplum Data Type |
---|---|
boolean | bool |
int | int4 |
smallint | int2 |
tinyint | int2 |
bigint | int8 |
float | float4 |
double | float8 |
string | text |
binary | bytea |
timestamp | timestamp |
Note: The hive:orc
profile does not support the timestamp data type when you specify vectorized query execution (VECTORIZE=true
).
Complex Data Types
Hive supports complex data types including array, struct, map, and union. PXF maps each of these complex types to text
. You can create Greenplum Database functions or application code to extract subcomponents of these complex data types.
Examples using complex data types with the hive
and hive:orc
profiles are provided later in this topic.
Note: The hive:orc
profile does not support complex types when you specify vectorized query execution (VECTORIZE=true
).
Sample Data Set
Examples presented in this topic operate on a common data set. This simple data set models a retail sales operation and includes fields with the following names and data types:
Column Name | Data Type |
---|---|
location | text |
month | text |
number_of_orders | integer |
total_sales | double |
Prepare the sample data set for use:
-
First, create a text file:
-
Add the following data to
pxf_hive_datafile.txt
; notice the use of the comma (,
) to separate the four field values:
Make note of the path to pxf_hive_datafile.txt
; you will use it in later exercises.
Hive Command Line
The Hive command line is a subsystem similar to that of psql
. To start the Hive command line:
The default Hive database is named default
.
Example: Creating a Hive Table
Create a Hive table to expose the sample data set.
-
Create a Hive table named
sales_info
in thedefault
database:Notice that:
- The
STORED AS textfile
subclause instructs Hive to create the table in Textfile (the default) format. Hive Textfile format supports comma-, tab-, and space-separated values, as well as data specified in JSON notation. - The
DELIMITED FIELDS TERMINATED BY
subclause identifies the field delimiter within a data record (line). Thesales_info
table field delimiter is a comma (,
).
- The
-
Load the
pxf_hive_datafile.txt
sample data file into thesales_info
table that you just created:In examples later in this section, you will access the
sales_info
Hive table directly via PXF. You will also insertsales_info
data into tables of other Hive file format types, and use PXF to access those directly as well. -
Perform a query on
sales_info
to verify that you loaded the data successfully:
Determining the HDFS Location of a Hive Table
Should you need to identify the HDFS file location of a Hive managed table, reference it using its HDFS file path. You can determine a Hive table’s location in HDFS using the DESCRIBE
command. For example:
Querying External Hive Data
You can create a Greenplum Database external table to access Hive table data. As described previously, the PXF Hive connector defines specific profiles to support different file formats. These profiles are named hive
, hive:text
, hive:rc
, and hive:orc
.
The hive:text
and hive:rc
profiles are specifically optimized for text and RCFile formats, respectively. The hive:orc
profile is optimized for ORC file formats. The hive
profile is optimized for all file storage types; you can use the hive
profile when the underlying Hive table is composed of multiple partitions with differing file formats.
PXF uses column projection to increase query performance when you access a Hive table using the hive
, hive:rc
, or hive:orc
profiles.
Use the following syntax to create a Greenplum Database external table that references a Hive table:
Hive connector-specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE call are described below.
Keyword | Value |
---|---|
<hive‑db‑name> | The name of the Hive database. If omitted, defaults to the Hive database named default . |
<hive‑table‑name> | The name of the Hive table. |
PROFILE=<profile_name> | <profile_name> must specify one of the values hive , hive:text , hive:rc , or hive:orc . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
PPD=<boolean> | Activate or deactivate predicate pushdown for all queries on this table; this option applies only to the hive , hive:orc , and hive:rc profiles, and overrides a pxf.ppd.hive property setting in the <server_name> configuration. |
VECTORIZE=<boolean> | When PROFILE=hive:orc , a Boolean value that specifies whether or not PXF uses vectorized query execution when accessing the underlying ORC files. The default value is false , does not use vectorized query execution. |
FORMAT (hive and hive:orc profiles) | The FORMAT clause must specify 'CUSTOM' . The CUSTOM format requires the built-in pxfwritable_import formatter . |
FORMAT (hive:text and hive:rc profiles) | The FORMAT clause must specify TEXT . Specify the single ascii character field delimiter in the delimiter='<delim>' formatting option. |
- Create the PXF external table with columns in a different order than the Hive table.
- Create a PXF external table that reads a subset of the columns in the Hive table.
- Read a Hive table where the files backing the table have a different number of columns.
Accessing TextFile-Format Hive Tables
You can use the hive
and hive:text
profiles to access Hive table data stored in TextFile format.
Example: Using the hive Profile
Use the hive
profile to create a readable Greenplum Database external table that references the Hive sales_info
textfile format table that you created earlier.
-
Create the external table:
-
Query the table:
Example: Using the hive:text Profile
Use the PXF hive:text
profile to create a readable Greenplum Database external table from the Hive sales_info
textfile format table that you created earlier.
-
Create the external table:
Notice that the
FORMAT
subclausedelimiter
value is specified as the single ascii comma character','
.E
escapes the character. -
Query the external table:
Accessing RCFile-Format Hive Tables
The RCFile Hive table format is used for row columnar formatted data. The PXF hive:rc
profile provides access to RCFile data.
Example: Using the hive:rc Profile
Use the hive:rc
profile to query RCFile-formatted data in a Hive table.
-
Start the
hive
command line and create a Hive table stored in RCFile format: -
Insert the data from the
sales_info
table intosales_info_rcfile
:A copy of the sample data set is now stored in RCFile format in the Hive
sales_info_rcfile
table. -
Query the
sales_info_rcfile
Hive table to verify that the data was loaded correctly: -
Use the PXF
hive:rc
profile to create a readable Greenplum Database external table that references the Hivesales_info_rcfile
table that you created in the previous steps. For example: -
Query the external table:
Accessing ORC-Format Hive Tables
The Optimized Row Columnar (ORC) file format is a columnar file format that provides a highly efficient way to both store and access HDFS data. ORC format offers improvements over text and RCFile formats in terms of both compression and performance. PXF supports ORC version 1.2.1.
ORC is type-aware and specifically designed for Hadoop workloads. ORC files store both the type of and encoding information for the data in the file. All columns within a single group of row data (also known as stripe) are stored together on disk in ORC format files. The columnar nature of the ORC format type enables read projection, helping avoid accessing unnecessary columns during a query.
ORC also supports predicate pushdown with built-in indexes at the file, stripe, and row levels, moving the filter operation to the data loading phase.
Refer to the Apache orc and the Apache Hive LanguageManual ORC websites for detailed information about the ORC file format.
Profiles Supporting the ORC File Format
When choosing an ORC-supporting profile, consider the following:
-
The
hive:orc
profile:- Reads a single row of data at a time.
- Supports column projection.
- Supports complex types. You can access Hive tables composed of array, map, struct, and union data types. PXF serializes each of these complex types to
text
.
-
The
hive:orc
profile withVECTORIZE=true
:- Reads up to 1024 rows of data at once.
- Supports column projection.
- Does not support complex types or the timestamp data type.
Example: Using the hive:orc Profile
In the following example, you will create a Hive table stored in ORC format and use the hive:orc
profile to query this Hive table.
-
Create a Hive table with ORC file format:
-
Insert the data from the
sales_info
table intosales_info_ORC
:A copy of the sample data set is now stored in ORC format in
sales_info_ORC
. -
Perform a Hive query on
sales_info_ORC
to verify that the data was loaded successfully: -
Start the
psql
subsystem and turn on timing: -
Use the PXF
hive:orc
profile to create a Greenplum Database external table that references the Hive table namedsales_info_ORC
you created in Step 1. TheFORMAT
clause must specify'CUSTOM'
. Thehive:orc
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
. -
Query the external table:
Example: Using the Vectorized hive:orc Profile
In the following example, you will use the vectorized hive:orc
profile to query the sales_info_ORC
Hive table that you created in the previous example.
-
Start the
psql
subsystem: -
Use the PXF
hive:orc
profile to create a readable Greenplum Database external table that references the Hive table namedsales_info_ORC
that you created in Step 1 of the previous example. TheFORMAT
clause must specify'CUSTOM'
. Thehive:orc
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
. -
Query the external table:
Accessing Parquet-Format Hive Tables
The PXF hive
profile supports both non-partitioned and partitioned Hive tables that use the Parquet storage format. Map the table columns using equivalent Greenplum Database data types. For example, if a Hive table is created in the default
schema using:
Define the Greenplum Database external table:
And query the table:
Accessing Avro-Format Hive Tables
The PXF hive
profile supports accessing Hive tables that use the Avro storage format. Map the table columns using equivalent Greenplum Database data types. For example, if a Hive table is created in the default
schema using:
Define the Greenplum Database external table:
And query the table:
Working with Complex Data Types
Example: Using the hive Profile with Complex Data Types
This example employs the hive
profile and the array and map complex types, specifically an array of integers and a string key/value pair map.
The data schema for this example includes fields with the following names and data types:
Column Name | Data Type |
---|---|
index | int |
name | string |
intarray | array of integers |
propmap | map of string key and value pairs |
When you specify an array field in a Hive table, you must identify the terminator for each item in the collection. Similarly, you must also specify the map key termination character.
-
Create a text file from which you will load the data set:
-
Add the following text to
pxf_hive_complex.txt
. This data uses a comma (,
) to separate field values, the percent symbol%
to separate collection items, and a:
to terminate map key values: -
Create a Hive table to represent this data:
Notice that:
FIELDS TERMINATED BY
identifies a comma as the field terminator.- The
COLLECTION ITEMS TERMINATED BY
subclause specifies the percent sign as the collection items (array item, map key/value pair) terminator. MAP KEYS TERMINATED BY
identifies a colon as the terminator for map keys.
-
Load the
pxf_hive_complex.txt
sample data file into thetable_complextypes
table that you just created: -
Perform a query on Hive table
table_complextypes
to verify that the data was loaded successfully: -
Use the PXF
hive
profile to create a readable Greenplum Database external table that references the Hive table namedtable_complextypes
:Notice that the integer array and map complex types are mapped to Greenplum Database data type text.
-
Query the external table:
intarray
andpropmap
are each serialized as text strings.
Example: Using the hive:orc Profile with Complex Data Types
In the following example, you will create and populate a Hive table stored in ORC format. You will use the hive:orc
profile to query the complex types in this Hive table.
-
Create a Hive table with ORC storage format:
-
Insert the data from the
table_complextypes
table that you created in the previous example intotable_complextypes_ORC
:A copy of the sample data set is now stored in ORC format in
table_complextypes_ORC
. -
Perform a Hive query on
table_complextypes_ORC
to verify that the data was loaded successfully: -
Start the
psql
subsystem: -
Use the PXF
hive:orc
profile to create a readable Greenplum Database external table from the Hive table namedtable_complextypes_ORC
you created in Step 1. TheFORMAT
clause must specify'CUSTOM'
. Thehive:orc
CUSTOM
format supports only the built-in'pxfwritable_import'
formatter
.Notice that the integer array and map complex types are again mapped to Greenplum Database data type text.
-
Query the external table:
intarray
andpropmap
are again serialized as text strings.
Partition Pruning
The PXF Hive connector supports Hive partition pruning and the Hive partition directory structure. This enables partition exclusion on selected HDFS files comprising a Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a query on a PXF external table using a WHERE
clause that refers to a specific partition column in a partitioned Hive table.
The PXF Hive Connector partition filtering support for Hive string and integral types is described below:
- The relational operators
=
,<
,<=
,>
,>=
, and<>
are supported on string types. - The relational operators
=
and<>
are supported on integral types (To use partition filtering with Hive integral types, you must update the Hive configuration as described in the Prerequisites). - The logical operators
AND
andOR
are supported when used with the relational operators mentioned above. - The
LIKE
string operator is not supported.
To take advantage of PXF partition filtering pushdown, the Hive and PXF partition field names must be the same. Otherwise, PXF ignores partition filtering and the filtering is performed on the Greenplum Database side, impacting performance.
PXF filter pushdown is enabled by default. You configure PXF filter pushdown as described in About Filter Pushdown.
Example: Using the hive Profile to Access Partitioned Homogenous Data
In this example, you use the hive
profile to query a Hive table named sales_part
that you partition on the delivery_state
and delivery_city
fields. You then create a Greenplum Database external table to query sales_part
. The procedure includes specific examples that illustrate filter pushdown.
-
Create a Hive table named
sales_part
with two partition columns,delivery_state
anddelivery_city:
-
Load data into this Hive table and add some partitions:
-
Query the
sales_part
table:A
SELECT *
statement on a Hive partitioned table shows the partition fields at the end of the record. -
Examine the Hive/HDFS directory structure for the
sales_part
table: -
Create a PXF external table to read the partitioned
sales_part
Hive table. To take advantage of partition filter push-down, define fields corresponding to the Hive partition fields at the end of theCREATE EXTERNAL TABLE
attribute list. -
Query the table:
-
Perform another query (no pushdown) on
pxf_sales_part
to return records where thedelivery_city
isSacramento
andcname
iscube
:The query filters the
delivery_city
partitionSacramento
. The filter oncname
is not pushed down, since it is not a partition column. It is performed on the Greenplum Database side after all the data in theSacramento
partition is transferred for processing. -
Query (with pushdown) for all records where
delivery_state
isCALIFORNIA
:This query reads all of the data in the
CALIFORNIA
delivery_state
partition, regardless of the city.
Example: Using the hive Profile to Access Partitioned Heterogeneous Data
You can use the PXF hive
profile with any Hive file storage types. With the hive
profile, you can access heterogeneous format data in a single Hive table where the partitions may be stored in different file formats.
In this example, you create a partitioned Hive external table. The table is composed of the HDFS data files associated with the sales_info
(text format) and sales_info_rcfile
(RC format) Hive tables that you created in previous exercises. You will partition the data by year, assigning the data from sales_info
to the year 2013, and the data from sales_info_rcfile
to the year 2016. (Ignore at the moment the fact that the tables contain the same data.) You will then use the PXF hive
profile to query this partitioned Hive external table.
-
Create a Hive external table named
hive_multiformpart
that is partitioned by a string field namedyear
: -
Describe the
sales_info
andsales_info_rcfile
tables, noting the HDFS filelocation
for each table: -
Create partitions in the
hive_multiformpart
table for the HDFS file locations associated with each of thesales_info
andsales_info_rcfile
tables: -
Explicitly identify the file format of the partition associated with the
sales_info_rcfile
table:You need not specify the file format of the partition associated with the
sales_info
table, asTEXTFILE
format is the default. -
Query the
hive_multiformpart
table: -
Show the partitions defined for the
hive_multiformpart
table and exithive
: -
Start the
psql
subsystem: -
Use the PXF
hive
profile to create a readable Greenplum Database external table that references the Hivehive_multiformpart
external table that you created in the previous steps: -
Query the PXF external table:
-
Perform a second query to calculate the total number of orders for the year 2013:
Using PXF with Hive Default Partitions
This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column excludes any data that is stored in the table’s default partition.
Similar to Hive, PXF represents a table’s partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a Greenplum Database query that includes an IS NULL
filter on a partitioning column can return different results than the same Hive query.
Consider a Hive partitioned table that is created with the statement:
The table is loaded with five rows that contain the following data:
Inserting row 4 creates a Hive default partition, because the partition column xdate
contains a null value.
In Hive, any query that filters on the partition column omits data in the default partition. For example, the following query returns no rows:
However, if you map this Hive table to a PXF external table in Greenplum Database, all default partition values are translated into actual NULL values. In Greenplum Database, running the same query against the PXF external table returns row 4 as the result, because the filter matches the NULL value.
Keep this behavior in mind when you run IS NULL
queries on Hive partitioned tables.
Content feedback and comments