I have raw data in RDS. I am using AWS Glue to crawl the data, and export the data through an ETL script in Glue (a full exact copy, no transformations or edits for now) to S3 as a single CSV file. I am trying to visualise this data in QuickSight but am getting different insights vs when I run SQL queries on the data itself. Later I will edit the Glue jobs to export different 'cuts' of data to S3 and then visualise.
What I tried and the issues?
- I downloaded the S3 CSV to ensure the CSV is generating the same output as SQL queries on the RDS table and it matches 100%. This means the issue is with QuickSight, not Glue.
I tried to connect the S3 file using 'S3' as a data source in QuickSight. I used the script provided below. The problem is it uses SPICE and I'm trying to run a direct query only (as my data will grow 100x soon, this is just a POC). This code is what is used as the manifest:
{ "fileLocations": [ { "URIs": [ "s3://pathofglueoutput/part-00000-blahblah.csv" ] }, { "URIPrefixes": [ "s3://pathofglueoutput/" ] } ], "globalUploadSettings": { "format": "CSV", "delimiter": ",", "containsHeader": "true" } }
I tried to connect the S3 file to Athena, and then import the data from Athena to QuickSight. My issue is that Athena won't 'refresh daily' so this is not a scalable approach, hence I was looking to use S3 (option 2 above) anyway.
I am struggling why the data from the ETL output of Glue (e.g. select count(field1) of s3data) and the same count on QuickSight (count - not distinct - of field1 of s3data) is completely different.