0

I am porting a python project (s3 + Athena) from using csv to parquet.

I can make the parquet file, which can be viewed by Parquet View.

I can upload the file to s3 bucket.

I can create the Athena table pointing to the s3 bucket.

However, when I query the table at Athena Web GUI, it runs for 10 mins (it seems that it will never stop) and there is no result shown.

The whole project is complicated. I try to simplify the case.

1.Let say, we have the following csv file (test.csv)

    "col1","col2"
    "A","B"

2.Then, I use the following python (2.7) code to convert it to parquet file (test.parquet)

    import fastparquet
    import pandas as pd

    df = pd.read_csv(r"test.csv")
    fastparquet.write(r"test.parquet", df, compression="GZIP")

3.Upload the test.parquet to s3 bucket folder "abc_bucket/abc_folder" by the s3 Web GUI

4.Create the following table by Athena Web GUI

    CREATE EXTERNAL TABLE IF NOT EXISTS abc_folder (
        `col1` string,
        `col2` string)
    STORED AS PARQUET
    LOCATION 's3://abc_bucket/abc_folder/'
    TBLPROPERTIES (
        "parquet.compress"="GZIP"
    );

5.Finally, run the following SQL at Athena. The SQL runs for 10 mins and seems forever.

    select *
    from abc_folder;

My question is which step above is wrong so that I cannot query the table from Athena.

It is highly appreciated for any help.

kzfid
  • 688
  • 3
  • 10
  • 17
  • You can test it by just putting a single file in that location, with a small bit of data. That can verify whether Athena is reading the file format okay. How big is your data? Is it in multiple files? Partitioned? – John Rotenstein Sep 06 '18 at 04:46
  • @JohnRotenstein Thanks for your reply. I have already tried what you suggest. That is the simplified case I mentioned at my question, where only one file with 2 rows (1 header row and 1 content row) and without any partition. – kzfid Sep 06 '18 at 06:20
  • Hi @kzfid , did you find solution to you problem ? I am facing exact same problem – TechMaster May 11 '22 at 12:01
  • Hi @TechMaster, Tried a few different ways, but failed to figure it out. Finally, I gave up using parquet. – kzfid May 12 '22 at 02:16
  • Okay @kzfid My probelm has been solved as i had a requirement of applying transformation so i converted csv to json and then to parquet. Hwoever initial problem in my case was with the column field name, a soon as i corrected it and refer them accroding to file schema it started to show data. Other diff is i am using snappy instead of GZip and also for conversion to parquet doing differently as also providing schema. I believe in your case parquet wasn't created properly – TechMaster May 12 '22 at 10:44
  • Hi @TechMaster Thanks very much for the update. From what you described, it looks like that the main differences are (1) csv->json->parquet vs csv->parquet (2) Snappy vs GZip. I do not have the access now as the project has been closed, but definitely worth to try if I have a chance later. For the column field name issue, you mean there is a convention of the column field name which needs to be followed or just a typo (the field name in csv not the same with column name in the create table statement in Athena)? And are you also using fastparquet (python) to do your csv->json->parquet trans? – kzfid May 13 '22 at 01:29
  • 1
    Hi @kzfid , yes point1&2 is correct. For the column fields it was just a type i believe there isn't any convention. Actually we have a lambda func that does csv to json then we are using AWS Glue job to perform json --> perquet. (This part of code is fully generated via AWS Glue job). – TechMaster May 16 '22 at 12:39

2 Answers2

0

Try to view your parquet data in S3 bucket itself with "Select From" option. If its fine, then use Athena to create table of your parquet file with proper table column headers. Later preview table to view the content.

Maxx
  • 709
  • 5
  • 11
  • 22
0

We can read parquet file in athena by creating a table for given s3 location.

CREATE EXTERNAL TABLE abc_new_table (
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT
)
PARTITIONED BY (flightdate STRING)
STORED AS PARQUET
LOCATION 's3://abc_bucket/abc_folder/'
tblproperties ("parquet.compression"="SNAPPY");

This assumes s3://abc_bucket/abc_folder/* directory has the parquet files compressed in SNAPPY format.

More details can be found in this AWS document.