5

Is it possible to get the header of staged csv file on snowflake into an array ?

I need to loop over all fields to insert data into our data vault model and it is really needed to get these column names onto an array.

alim1990
  • 4,656
  • 12
  • 67
  • 130

3 Answers3

5

Actually it was solved by using the following query over a staged file in a JavaScript stored procedure:

var get_length_and_columns_array = "select array_size(split($1,',')) as NO_OF_COL, "+
        "split($1,',') as COLUMNS_ARRAY from "+FILE_FULL_PATH+" "+ 
        "(file_format=>"+ONE_COLUMN_FORMAT_FILE+") limit 1";

The ONE_COLUMN_FORMAT_FILE will put all fields into one in order to make this query works:

CREATE FILE FORMAT ONE_COLUMN_FILE_FORMAT 
TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = '|' RECORD_DELIMITER = '\n' 
SKIP_HEADER = 0 
FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' 
TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE 
ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' 
DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
alim1990
  • 4,656
  • 12
  • 67
  • 130
1

Yes, you can query the following metadata of your staged files:

  • METADATA$FILENAME: Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.
  • METADATA$FILE_ROW_NUMBER: Row number for each record in the container staged data file.

So there is not enough information. But: There is the parameter SKIP_HEADER that can be used in your COPY INTO-command. So my suggestion for a workaround is:

  1. Copy your data into a table by using SKIP_HEADER and thus also load your header into your table as regular column values
  2. Query the first row which are the column names
  3. Use this as input for further processing

More infos about the parameter within the COPY INTO-Command https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html

Marcel
  • 2,454
  • 1
  • 5
  • 13
  • But how can I copy stage value into a temporary table without explicitly specifying the fields of this table first. – alim1990 Jan 18 '21 at 16:25
  • 1
    If I should prepare a DDL about the table having the same fields number of the headers in the stage that would not solve my problem. – alim1990 Jan 18 '21 at 16:26
1

Currently dynamically generating the column list from a csv file is not currently available in snowflake or most platform afaik.

csv is not the ideal format for this kind of schema on read operation.

if you are able to work with your input files, I would suggest converting the csv to json. If you use json instead, you can then use snowflake to process the file.

here is some context:

Load CSV files with dynamic column headers Not Supported

Example of loading json data

Example Converting CSV to JSON with Pandas

import pandas as pd
import csvkit 

filepath = '/home/username/data/sales.csv'
jsonfilepath = filepath.replace('.csv','.json')
df = pd.read_csv(filepath)

# df.to_json(jsonfilepath, orient="table", date_format="iso", index=False)
df.to_json(jsonfilepath, orient="records", date_format="iso")
print("Input File: {}\r\nOutput File: {}".format(filepath, jsonfilepath))

Example Converting CSV to JSON with csvkit

csvjson  -i 4 '/home/username/data/sales.csv' > '/home/username/data/sales.csvkit.json'

Querying Semi-Structured Data in Snowflake

Loading JSON Data into Snowflake

/* Create a target relational table for the JSON data. The table is temporary, meaning it persists only for   */
/* the duration of the user session and is not visible to other users.                                        */

create or replace temporary table home_sales (
  city string,
  zip string,
  state string,
  type string default 'Residential',
  sale_date timestamp_ntz,
  price string
  );

/* Create a named file format with the file delimiter set as none and the record delimiter set as the new     */
/* line character.                                                                                            */
/*                                                                                                            */
/* When loading semi-structured data (e.g. JSON), you should set CSV as the file format type (default value). */
/* You could use the JSON file format, but any error in the transformation would stop the COPY operation,     */
/* even if you set the ON_ERROR option to continue or skip the file.                                          */

create or replace file format sf_tut_csv_format
  field_delimiter = none
  record_delimiter = '\\n';

/* Create a temporary internal stage that references the file format object.                                  */
/* Similar to temporary tables, temporary stages are automatically dropped at the end of the session.         */

create or replace temporary stage sf_tut_stage
  file_format = sf_tut_csv_format;

/* Stage the data file.                                                                                       */
/*                                                                                                            */
/* Note that the example PUT statement references the macOS or Linux location of the data file.               */
/* If you are using Windows, execute the following statement instead:                                         */

-- PUT file://%TEMP%/sales.json @sf_tut_stage;

put file:///tmp/sales.json @sf_tut_stage;

/* Load the JSON data into the relational table.                                                              */
/*                                                                                                            */
/* A SELECT query in the COPY statement identifies a numbered set of columns in the data files you are        */
/* loading from. Note that all JSON data is stored in a single column ($1).                                   */

copy into home_sales(city, state, zip, sale_date, price)
   from (select substr(parse_json($1):location.state_city,4), substr(parse_json($1):location.state_city,1,2),
                parse_json($1):location.zip, to_timestamp_ntz(parse_json($1):sale_date), parse_json($1):price
         from @sf_tut_stage/sales.json.gz t)
   on_error = 'continue';

/* Query the relational table                                                                                 */

select * from home_sales;
john.da.costa
  • 4,682
  • 4
  • 29
  • 30
  • Thanks. can you check my answer, because I found a way around it. – alim1990 Jul 12 '21 at 06:37
  • 1
    looks good. if you use the copy into here, i am curious if you will have to do force=TRUE the next time you want to load the data. its definitely interesting. Snowflake is very powerful. Lots of flexibility. – john.da.costa Jul 13 '21 at 21:37
  • "or most platform" - highly controversial statement, counter example, that is quite frequently used to import csv, - spark. Maybe changing to "most databases" will sound less controversial – MxR May 23 '22 at 03:49