0

I am using HAWQ to handle a column-based file. While reading the Pivotal document, they suggest that user should use gpfdist to read and write the readable external table in order to quickly process the data in a parallel way.

I made a table as recommended in the documentation and confirmed my data by SQL as below statement.

CREATE EXTERNAL TABLE ext_data
(col1 text, col2 text,col3 text, col4 text, col5 int, col6 int, col7 int,col8 int)
LOCATION ('gpfdist://hawq2:8085/*.csv')
FORMAT 'CSV'(DELIMITER ',');

SELECT gp_segment_id,count(*) from ext_data GROUP BY gp_segment_id;

The data was evenly distributed on all the slave nodes.

Previous my goal was creating the table, reading the data from the file and identifying the loaded data was distributed well. It was achieved by above procedure using gpfdist.

But the question is the difference between the external table and internal table. What is the reason of being using external or internal table even though two methods were same functionality.

I found some blogs that some users follow below procedures when using HAWQ or Greenplume database.

1. making external table using gpfdist
2. making internal table again
3. reading the data from external data into internal data.

I didn't fully get the idea of this behavior. Above all, I don't know why external and internal table exist and should be used for handling data using Apache Hawq or greenplume database.

sclee1
  • 1,095
  • 1
  • 15
  • 36
  • External table is for data exploration or parallel data loading in most use cases. When you query ext table, data are loading from gpfdist to segment memory directly. Let's say you have 10segments and 1 gpfdist hosting files. When you do select in this external table all 10segments read files on the same gpfdist. In this case your query speed is limited by the network. Internal table data are store in hdfs and segment will work on local hdfs data. It will be much faster. – Sung Yu-wei Mar 09 '17 at 15:45

1 Answers1

2

An External Table that uses gpfdist

  • Data is in a posix filesystem, not HDFS
  • No statistics
  • Files could be on ETL nodes which aren't part of the cluster
  • You could have multiple files across many servers too
  • Ideal solution to load data in parallel to an Internal table

    insert into table_name select * from external_table_name;
    

Internal Table

  • Data is stored in HDFS
  • Statistics are gathered and stored in the catalog
  • Files are HDFS files only
  • You can take advantage of HDFS features like parquet format and snappy compression
  • Provide the best performance for queries

External Tables just make it easier to load data into the database and make it do so faster.

Think about this scenario. You get a file from your Accounting system that you need to load. You could do this:

  1. scp the file to an edge node
  2. hdfs put the file into hdfs
  3. Create an external table in HAWQ using PXF to read the file
  4. Insert the data into HAWQ table

That will work and PXF will read the file in HDFS in parallel. However, step 2 is a single process and a bottleneck. Instead, do this:

  1. scp the file to an edge node
  2. Start a gpfdist process
  3. Create an external table in HAWQ using gpfdist to read the file
  4. Insert the data into HAWQ table

Now the "put" into HDFS is done in parallel because HAWQ will start virtual segments on each node to put the data. This are typically 6 virtual segments per data node so in a 10 node cluster, you'll have 60 processes putting data into HDFS rather than a single one.

Jon Roberts
  • 2,068
  • 1
  • 9
  • 11
  • Thanks for your comment. In the first scenario, why you make the external table although all the data were located in the hdfs by case 2? Skipping the case 3, making internal table and inserting the data into the table is impossible? I almost got the idea of hawq and its gpfdist... – sclee1 Mar 09 '17 at 17:33
  • 1
    In the first scenario, the files would be in text format in HDFS and not formatted as a HAWQ table. You can read it with an external table using PXF but like with gpfdist, you don't have statistics. The best way to load it directly from a file into a HAWQ table is to use gpfdist. – Jon Roberts Mar 09 '17 at 17:50
  • 1
    if you have data in hdfs, for example, data in hive, hbase or other data format, you can use pxf external table to query them without bring them into hawq internal table. the performance of query will not as good as internal table. but it is very convenient if you want to do federation query across different sources. – Sung Yu-wei Mar 09 '17 at 18:17
  • 1
    again, pxf external tables are for data already on hdfs and you don't want to import all into hawq internal tables for exploratory purposes. gpfdist external tables are mostly used as ETL parallel loading into hawq directly and source data are not on hdfs but other file systems. – Sung Yu-wei Mar 09 '17 at 18:21
  • @JonRoberts In the second scenario, I didn't get why "put" command appeared although there were no hadoop command in the procedure compared to the first scenario. – sclee1 Mar 10 '17 at 08:12
  • The put command is there to put the data into hdfs. I then read that data in hdfs with an external table using the PXF protocol. This is similar to how you can put data into hdfs and then read it with an external table in Hive. – Jon Roberts Mar 10 '17 at 14:46
  • In the second scenario, I don't have to "put" the data into hdfs. I'm reading the posix file using gpfdist and HAWQ is executing the "put". But instead of a single "put", there is one for each vseg. – Jon Roberts Mar 10 '17 at 14:59