You can use the VMware Tanxu Greenplum platform extension framework (PXF) HDFS connector to read one or more multi-line text files in HDFS each as a single table row. This may be useful when you want to read multiple files into the same Greenplum Database external table, for example when individual JSON files each contain a separate record.
PXF supports reading only text and JSON files in this manner.
Note: Refer to the Reading and Writing JSON Data in HDFS topic if you want to use PXF to read JSON files that include more than one record.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read files from HDFS.
Reading Multi-Line Text and JSON Files
You can read single- and multi-line files into a single table row, including files with embedded linefeeds. If you are reading multiple JSON files, each file must be a complete record, and each file must contain the same record type.
PXF reads the complete file data into a single row and column. When you create the external table to read multiple files, you must ensure that all of the files that you want to read are of the same (text or JSON) type. You must also specify a single text
or json
column, depending upon the file type.
The following syntax creates a Greenplum Database readable external table that references one or more text or JSON files on HDFS:
CREATE EXTERNAL TABLE <table_name>
( <column_name> text|json | LIKE <other_table> )
LOCATION ('pxf://<path-to-files>?PROFILE=hdfs:text:multi[&SERVER=<server_name>][&IGNORE_MISSING_PATH=<boolean>]&FILE_AS_ROW=true')
FORMAT 'CSV');
The keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑files> | The path to the directory or files in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑files> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑files> must not specify a relative path nor include the dollar sign ($ ) character. |
PROFILE | The PROFILE keyword must specify hdfs:text:multi . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
FILE_AS_ROW=true | The required option that instructs PXF to read each file into a single table row. |
IGNORE_MISSING_PATH=<boolean> | Specify the action to take when <path-to-files> is missing or invalid. The default value is false , PXF returns an error in this situation. When the value is true , PXF ignores missing path errors and returns an empty fragment. |
FORMAT | The FORMAT must specify 'CSV' . |
Note: The hdfs:text:multi
profile does not support additional custom or format options when you specify the FILE_AS_ROW=true
option.
For example, if /data/pxf_examples/jdir
identifies an HDFS directory that contains a number of JSON files, the following statement creates a Greenplum Database external table that references all of the files in that directory:
CREATE EXTERNAL TABLE pxf_readjfiles(j1 json)
LOCATION ('pxf://data/pxf_examples/jdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
When you query the pxf_readjfiles
table with a SELECT
statement, PXF returns the contents of each JSON file in jdir/
as a separate row in the external table.
When you read JSON files, you can use the JSON functions provided in Greenplum Database to access individual data fields in the JSON record. For example, if the pxf_readjfiles
external table above reads a JSON file that contains this JSON record:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
}
}
]
}
You can use the json_array_elements()
function to extract specific JSON fields from the table row. For example, the following command displays the user->id
field:
SELECT json_array_elements(j1->'root')->'record_obj'->'user'->'id'
AS userid FROM pxf_readjfiles;
userid
----------
31424214
(1 rows)
Refer to Working with JSON Data in the Greenplum Documentation for specific information on manipulating JSON data in Greenplum.
Example: Reading an HDFS Text File into a Single Table Row
Perform the following procedure to create 3 sample text files in an HDFS directory, and use the PXF hdfs:text:multi
profile and the default PXF server to read all of these text files in a single external table query.
-
Create an HDFS directory for the text files. For example:
$ hdfs dfs -mkdir -p /data/pxf_examples/tdir
-
Create a text data file named
file1.txt
:$ echo 'text file with only one line' > /tmp/file1.txt
-
Create a second text data file named
file2.txt
:$ echo 'Prague,Jan,101,4875.33 Rome,Mar,87,1557.39 Bangalore,May,317,8936.99 Beijing,Jul,411,11600.67' > /tmp/file2.txt
This file has multiple lines.
-
Create a third text file named
/tmp/file3.txt
:$ echo '"4627 Star Rd. San Francisco, CA 94107":Sept:2017 "113 Moon St. San Diego, CA 92093":Jan:2018 "51 Belt Ct. Denver, CO 90123":Dec:2016 "93114 Radial Rd. Chicago, IL 60605":Jul:2017 "7301 Brookview Ave. Columbus, OH 43213":Dec:2018' > /tmp/file3.txt
This file includes embedded line feeds.
-
Save the file and exit the editor.
-
Copy the text files to HDFS:
$ hdfs dfs -put /tmp/file1.txt /data/pxf_examples/tdir $ hdfs dfs -put /tmp/file2.txt /data/pxf_examples/tdir $ hdfs dfs -put /tmp/file3.txt /data/pxf_examples/tdir
-
Log in to a Greenplum Database system and start the
psql
subsystem. -
Use the
hdfs:text:multi
profile to create an external table that references thetdir
HDFS directory. For example:CREATE EXTERNAL TABLE pxf_readfileasrow(c1 text) LOCATION ('pxf://data/pxf_examples/tdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true') FORMAT 'CSV';
-
Turn on expanded display and query the
pxf_readfileasrow
table:postgres=# \x on postgres=# SELECT * FROM pxf_readfileasrow;
-[ RECORD 1 ]--------------------------- c1 | Prague,Jan,101,4875.33 | Rome,Mar,87,1557.39 | Bangalore,May,317,8936.99 | Beijing,Jul,411,11600.67 -[ RECORD 2 ]--------------------------- c1 | text file with only one line -[ RECORD 3 ]--------------------------- c1 | "4627 Star Rd. | San Francisco, CA 94107":Sept:2017 | "113 Moon St. | San Diego, CA 92093":Jan:2018 | "51 Belt Ct. | Denver, CO 90123":Dec:2016 | "93114 Radial Rd. | Chicago, IL 60605":Jul:2017 | "7301 Brookview Ave. | Columbus, OH 43213":Dec:2018
Content feedback and comments