1

I'm working on moving some Parquet files in S3 over to Snowflake. The Storage Integration & External Stage were created just fine, and when I run the list @mystage command I can see the file that I want to check out in S3 so I know it exists & that my connection works.

When I go to do any kind of querying on the columns or table metadata, nothing is returned.

I'm following the basic documentation from this Snowflake Page.

list @manual_test_stage;

enter image description here

create stage manual_test_stage
  url = 's3://cs-snowflake-poc/manual-loads/'
  storage_integration = cs_snowflake_poc_s3;

create or replace file format parquet_format
  type = 'parquet';

select *
  from table(
    infer_schema(
      location=>'@manual_test_stage/campspot_production.Campsite/test.parquet',
      file_format=>'parquet_format'
      )
    );

-- neither query works
select *
  from table(
    infer_schema(
      location=>'@manual_test_stage',
      file_format=>'parquet_format'
      )
    );

-- end goal is to use this to just create the table for me by automatically reading the schema, but i can't do that when select array_agg returns an empty array.
create table mytable
  using template (
    select array_agg(object_construct(*))
      from table(
        infer_schema(
          location=>'@manual_test_stage/campspot_production.Campsite/test.parquet',
          file_format=>'my_parquet_format'
        )
      ));

Tried all sorts of things: I recreated the stage to not include a / at the end, I tried like 5 different file format options in case that was screwing it up, and I loaded the parquet into Python to make sure it wasn't some issue with the actual file.

If anyone can help show me what I'm missing to get this to work it'd be much appreciated !

jyablonski
  • 711
  • 1
  • 7
  • 17

1 Answers1

2

INFER_SCHEMA allows to define FILES parameter.

select *
  from table(
    infer_schema(
      location=>'@manual_test_stage/campspot_production.Campsite',
      file_format=>'parquet_format',
      files=>'test.parquet'
      )
    );

Additionally checking the direct access is recommeded to confirm it is not a permission issue:

SELECT t.$1 
FROM "@manual_test_stage/campspot_production.Campsite/test.parquet"
(file_format=>'parquet_format') AS t;
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • trying the files parameter out but doesn't seem to be working unfortunately. also tried to rename that `campspot_production.Campsite` folder into something more simple like `final` incase the `.` was causing issues but it didn't seem to fix anything ;/ appreciate the suggestion though – jyablonski Sep 08 '22 at 17:38
  • @jyablonski Can you query the data? `SELECT t.$1 FROM "@manual_test_stage/campspot_production.Campsite/test.parquet"(file_format=>'parquet_format') AS t` – Lukasz Szozda Sep 08 '22 at 17:41
  • @jyablonski Second, which version of Parquet file do you use [What is Parquet?](https://docs.snowflake.com/en/user-guide/semistructured-data-formats.html#parquet): "Currently, Snowflake supports the schema of Parquet files produced using the Parquet writer v1. Files produced using v2 of the writer are not supported for data loading." – Lukasz Szozda Sep 08 '22 at 17:48
  • 1
    Figured it out thanks to your query comment! When setting up credentials I only listed 1 specific folder for the IAM Role to have `GET OBJECT` access to, but every other path had `LIST` access. That's why I was able to list out the files in that separate folder with the `@list` command, but not be able to access them. – jyablonski Sep 08 '22 at 17:58