0

This question is different than what I have found on stackoverflow due to the size of data, it is NOT duplicated.

We are using Cloudera.

I have seen solution for small xlsx files with only handful columns in header, in my case the csv file to be loaded into a new hive table has 618 columns.

  1. Would it be saved as parquet by default if I upload it (save it to csv first) through HUE-> File Browser? if not, where can I specify the file format?

  2. What would be the best way to create an external Impala table based on that location? It would definitely be unbelievable if I need to create the DDL/schema manually as there are so many columns.

Thank you very much.

Choix
  • 555
  • 1
  • 12
  • 28
  • What I find "definitely unbelievable" is that you cannot just use the header record to generate the _CREATE TABLE_ with a few lines of script. For example, a plain Linux `head -n 1 turd.csv | sed 's/,/ String,\n/g' ` command can split the header into 618 lines and append _" String,"_ after each column name. The rest is trivial. – Samson Scharfrichter Jul 28 '18 at 19:37

1 Answers1

1

Answers:

  • Text file is the default file format through Hive table generation. But it can be configured at hive.default.fileformat or you can mention it explicitly while creating a table. You can upload CSV file into any directory you want in HDFS. Once the data is there in HDFS you can create a table over the CSV data. While creating a table you can specify the format.
  • Use Hue to create a table. It generates column names dynamically based on header line in CSV file. It assumes every field as string datatype. we need to explicitly take care of datatypes. Once the table is created in hive metastore that can be used through Hive and Impala queries.

This post will provide a good start: http://gethue.com/hadoop-tutorial-create-hive-tables-with-headers-and/

In nutshell, Move data to (as shown below)

HDFS => Create table using Hue(take care of datatypes) => Now query data using Impala editor.

  • Thank you, what is needed is how to save file on HDFS default to parquet, would the updated file be automatically saved as parquet if I change the default to hive.default.fileformat = Parquet? – Choix Aug 06 '18 at 21:06