I am porting a python project (s3 + Athena) from using csv to parquet.
I can make the parquet file, which can be viewed by Parquet View.
I can upload the file to s3 bucket.
I can create the Athena table pointing to the s3 bucket.
However, when I query the table at Athena Web GUI, it runs for 10 mins (it seems that it will never stop) and there is no result shown.
The whole project is complicated. I try to simplify the case.
1.Let say, we have the following csv file (test.csv)
"col1","col2"
"A","B"
2.Then, I use the following python (2.7) code to convert it to parquet file (test.parquet)
import fastparquet
import pandas as pd
df = pd.read_csv(r"test.csv")
fastparquet.write(r"test.parquet", df, compression="GZIP")
3.Upload the test.parquet to s3 bucket folder "abc_bucket/abc_folder" by the s3 Web GUI
4.Create the following table by Athena Web GUI
CREATE EXTERNAL TABLE IF NOT EXISTS abc_folder (
`col1` string,
`col2` string)
STORED AS PARQUET
LOCATION 's3://abc_bucket/abc_folder/'
TBLPROPERTIES (
"parquet.compress"="GZIP"
);
5.Finally, run the following SQL at Athena. The SQL runs for 10 mins and seems forever.
select *
from abc_folder;
My question is which step above is wrong so that I cannot query the table from Athena.
It is highly appreciated for any help.