1

I have 53 csv files, each with different column names and contents. I want to write all the data in the csv file into one big table in Hive that contains all the necessary columns. So it may happen that columns are left empty for certain files. This depends on the csv file. Unfortunately I have no idea how to deal with this problem.

Is there a possibility to match column names from csv files to the Hive table? I've only find solutions with non-dynamic columns in the csv.

Do I have to set up a table for each file and join them afterwards?

Thank you!

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Nicko
  • 19
  • 6

1 Answers1

1

Do I have to set up a table for each file and join them afterwards?

Yes, setup a table for each file and put file in each table location.

Some tables can be combined. For example if you have col2, col2, col3 in one CSV file and col1, col2, col3, col4, col5 in second CSV file (common columns are in the same positions and extra columns at the end), then you can create single table with superset of columns, columns absent in first file will be selected as NULL and not NULL from second file, put both file into the same table location.

Also you can combine different files (same columns positioned differently) into single table location (table should have as many string columns as in the widest file), then in the select you can use INPUT__FILE__NAME pseudocolumn, parse filename from it and calculate columns depending on it, for example something like this:

select 
case when INPUT__FILE__NAME rlike 'invoice\\.csv' then col1 
     when INPUT__FILE__NAME rlike 'transaction\\.csv' then col3
     else NULL
 end as invoice_nbr

After you created all tables, you can join them or UNION ALL + aggregate depending on the requirements.

Much better solution is to use JSON instead of CSV in this case:

  • JSON contains names, position does not matter
  • JSON allows absence of attributes
  • If some attribute is not in the JSON record, NULL returned as column value for that record
  • you can put JSON files with different content into the same location and create a table with superset of columns, see how to create table answer 1, answer 2 and answer 3
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • 1
    Thanks @leftjoin. Maybe I will try another way like Apache NiFi to solve this problem, because I can't say for sure that every file has the same structure every day, because it content changes daily. – Nicko Jul 28 '21 at 10:54
  • 1
    @Nicko In such case JSON is much better format (use JSONSerDe) because you can define table with superset of possible columns and put all JSON files ito it's location, and if column is absent in JSON record, it will be NULL. This perfectly matches your scenario. JSON allows attributes to be absent. You can put different JSON files into the same table location. JSON has names inside and binding is not positional like in CSV – leftjoin Jul 28 '21 at 12:06
  • I appreciate your helpfulness. This works for me! – Nicko Jul 29 '21 at 09:18
  • 1
    @Nicko Added in the answer, hope it will be useful. – leftjoin Jul 30 '21 at 07:51
  • 1
    Thanks for updating your answer. I've just wanted to announce that I know successfully finished my project with your help! @leftjoin – Nicko Sep 16 '21 at 08:24