The PXF HDFS Connector supports reading and writing fixed-width text using the Greenplum Database fixed width custom formatter. This section describes how to use PXF to access fixed-width text, including how to create, query, and insert data into an external table that references files in the HDFS data store.
PXF supports reading or writing fixed-width text that is compressed with the default
, bzip2
, and gzip
codecs.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read fixed-width text from, or write fixed-width text to, HDFS.
Reading Text Data with Fixed Widths
Use the hdfs:fixedwidth
profile when you read fixed-width text where each line is considered a single record. The following syntax creates a Greenplum Database readable external table that references such a text file on HDFS:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:fixedwidth[&SERVER=<server_name>][&NEWLINE=<bytecode>][&IGNORE_MISSING_PATH=<boolean>]')
FORMAT 'CUSTOM' (FORMATTER='fixedwidth_in', <field_name>='<width>' [, ...] [, line_delim[=|<space>][E]'<delim_value>']);
The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign ($ ) character. |
PROFILE | Use PROFILE=hdfs:fixedwidth when <path-to-hdfs-file> references fixed-width text data. |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
NEWLINE=<bytecode> | When the line_delim formatter option contains \r , \r\n , or a set of custom escape characters, you must set <bytecode> to CR , CRLF , or the set of bytecode characters, respectively. |
IGNORE_MISSING_PATH=<boolean> | Specify the action to take when <path-to-hdfs-file> is missing or invalid. The default value is false , PXF returns an error in this situation. When the value is true , PXF ignores missing path errors and returns an empty dataset. |
FORMAT ‘CUSTOM’ | Use FORMAT ‘CUSTOM ’ with FORMATTER='fixedwidth_in' (read). |
<field_name>=‘<width>’ | The name and the width of the field. For example: first_name='15' specifies that the first_name field is 15 characters long. By default, when the field value is less than <width> size, Greenplum Database expects the field to be right-padded with spaces to that size. |
line_delim | The line delimiter character in the data. Preface the <delim_value> with an E when the value is an escape sequence. Examples: line_delim=E'\n' , line_delim 'aaa' . The default value is '\n' . |
Note: PXF does not support the (HEADER)
formatter option in the CREATE EXTERNAL TABLE
command.
About Specifying field_name and width
Greenplum Database loads all fields in a line of fixed-width data in their physical order. The <field_name>
s that you specify in the FORMAT
options must match the order that you define the columns in the CREATE [WRITABLE] EXTERNAL TABLE
command. You specify the size of each field in the <width>
value.
Refer to the Greenplum Database fixed width custom formatter documentation for more information about the formatter options.
About the line_delim and NEWLINE Formatter Options
By default, Greenplum Database uses the \n
(LF) character for the new line delimiter. When the line delimiter for the external file is also \n
, you need not specify the line_delim
option. If the line_delim
formatter option is provided and contains \r
(CR), \r\n
(CRLF), or a set of custom escape characters, you must specify the NEWLINE
option in the external table LOCATION
clause, and set the value to CR
, CRLF
or the set of bytecode characters, respectively.
Refer to the Greenplum Database fixed width custom formatter documentation for more information about the formatter options.
Example: Reading Fixed-Width Text Data on HDFS
Perform the following procedure to create a sample text file, copy the file to HDFS, and use the hdfs:fixedwidth
profile and the default PXF server to create a PXF external table to query the data:
-
Create an HDFS directory for PXF example data files. For example:
$ hdfs dfs -mkdir -p /data/pxf_examples
-
Create a plain text data file named
pxf_hdfs_fixedwidth.txt
:$ echo 'Prague Jan 101 4875.33 Rome Mar 87 1557.39 Bangalore May 317 8936.99 Beijing Jul 411 11600.67 ' > /tmp/pxf_hdfs_fixedwidth.txt
In this sample file, the first field is 15 characters long, the second is 4 characters, the third is 6 characters, and the last field is 10 characters long.
Open the
/tmp/pxf_hdfs_fixedwidth.txt
file in the editor of your choice, and ensure that the last field is right-padded with spaces to 10 characters in size. -
Copy the data file to HDFS:
$ hdfs dfs -put /tmp/pxf_hdfs_fixedwidth.txt /data/pxf_examples/
-
Display the contents of the
pxf_hdfs_fixedwidth.txt
file stored in HDFS:$ hdfs dfs -cat /data/pxf_examples/pxf_hdfs_fixedwidth.txt
-
Start the
psql
subsystem:$ psql -d postgres
-
Use the PXF
hdfs:fixedwidth
profile to create a Greenplum Database external table that references thepxf_hdfs_fixedwidth.txt
file that you just created and added to HDFS:postgres=# CREATE EXTERNAL TABLE pxf_hdfs_fixedwidth_r(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/pxf_hdfs_fixedwidth.txt?PROFILE=hdfs:fixedwidth&NEWLINE=CRLF') FORMAT 'CUSTOM' (formatter='fixedwidth_in', location='15', month='4', num_orders='6', total_sales='10', line_delim=E'\r\n');
-
Query the external table:
postgres=# SELECT * FROM pxf_hdfs_fixedwidth_r;
location | month | num_orders | total_sales ---------------+-------+------------+------------- Prague | Jan | 101 | 4875.33 Rome | Mar | 87 | 1557.39 Bangalore | May | 317 | 8936.99 Beijing | Jul | 411 | 11600.67 (4 rows)
Writing Fixed-Width Text Data to HDFS
The PXF HDFS connector hdfs:fixedwidth
profile supports writing fixed-width text to HDFS. When you create a writable external table with the PXF HDFS connector, you specify the name of a directory on HDFS. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
Note: External tables that you create with a writable profile can only be used for INSERT
operations. If you want to query the data that you inserted, you must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database writable external table that references an HDFS directory:
CREATE WRITABLE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:fixedwidth[&SERVER=<server_name>][&NEWLINE=<bytecode>][&<write-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='fixedwidth_out' [, <field_name>='<width>'] [, ...] [, line_delim[=|<space>][E]'<delim_value>']);
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
The specific keywords and values used in the CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑dir> | The path to the directory in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑dir> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑dir> must not specify a relative path nor include the dollar sign ($ ) character. |
PROFILE | Use PROFILE=hdfs:fixedwidth to write fixed-width data to <path-to-hdfs-file>. |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
NEWLINE=<bytecode> | When the line_delim formatter option contains \r , \r\n , or a set of custom escape characters, you must set <bytecode> to CR , CRLF or the set of bytecode characters, respectively. |
<write‑option>=<value> | <write-option>s are described below. |
FORMAT ‘CUSTOM’ | Use FORMAT ‘CUSTOM ’ with FORMATTER='fixedwidth_out' (write). |
<field_name>=‘<width>’ | The name and the width of the field. For example: first_name='15' specifies that the first_name field is 15 characters long. By default, when writing to the external file and the field value is less than <width> size, Greenplum Database right-pads the field with spaces to <width> size. |
line_delim | The line delimiter character in the data. Preface the <delim_value> with an E when the value is an escape sequence. Examples: line_delim=E'\n' , line_delim 'aaa' . The default value is '\n' . |
DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
Writable external tables that you create using the hdfs:fixedwidth
profile can optionally use record or block compression. You specify the compression codec via an option in the CREATE WRITABLE EXTERNAL TABLE
LOCATION
clause:
Write Option | Value Description |
---|---|
COMPRESSION_CODEC | The compression codec alias. Supported compression codecs for writing fixed-width text data include: default , bzip2 , gzip , and uncompressed . If this option is not provided, Greenplum Database performs no data compression. |
Example: Writing Fixed-Width Text Data to HDFS
This example utilizes the data schema introduced in Example: Reading Fixed-Width Text Data on HDFS.
Column Name | Width | Data Type |
---|---|---|
location | 15 | text |
month | 4 | text |
number_of_orders | 6 | int |
total_sales | 10 | float8 |
Procedure
Perform the following procedure to create a Greenplum Database writable external table utilizing the same data schema as described above. You will use the PXF hdfs:fixedwidth
profile and the default PXF server to write data to the underlying HDFS directory. You will also create a separate, readable external table to read the data that you wrote to the HDFS directory.
-
Create a Greenplum Database writable external table utilizing the data schema described above. Write to the HDFS directory
/data/pxf_examples/fixedwidth_write
. Create the table specifying\n
as the line delimiter:postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_fixedwidth_w(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/fixedwidth_write?PROFILE=hdfs:fixedwidth') FORMAT 'CUSTOM' (formatter='fixedwidth_out', location='15', month='4', num_orders='6', total_sales='10');
-
Write a few individual records to the
fixedwidth_write
HDFS directory by using theINSERT
command on thepxf_hdfs_fixedwidth_w
table:postgres=# INSERT INTO pxf_hdfs_fixedwidth_w VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 ); postgres=# INSERT INTO pxf_hdfs_fixedwidth_w VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
-
In another terminal window, use the
cat
command on thefixedwidth_write
directory to display the data that you just added to HDFS:$ hdfs dfs -cat /data/pxf_examples/fixedwidth_write/* Frankfurt Mar 777 3956.98 Cleveland Oct 3812 96645.37
-
Greenplum Database does not support directly querying a writable external table. To query the data that you just added to HDFS, you must create a readable external Greenplum Database table that references the HDFS directory:
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_fixedwidth_r2(location text, month text, num_orders int, total_sales float8) LOCATION ('pxf://data/pxf_examples/fixedwidth_write?PROFILE=hdfs:fixedwidth') FORMAT 'CUSTOM' (formatter='fixedwidth_in', location='15', month='4', num_orders='6', total_sales='10');
-
Query the readable external table:
postgres=# SELECT * FROM pxf_hdfs_fixedwidth_r2 ORDER BY total_sales;
location | month | num_orders | total_sales -----------+-------+------------+------------- Frankfurt | Mar | 777 | 3956.98 Cleveland | Oct | 3812 | 96645.37 (2 rows)
Content feedback and comments