1

I have raw data in RDS. I am using AWS Glue to crawl the data, and export the data through an ETL script in Glue (a full exact copy, no transformations or edits for now) to S3 as a single CSV file. I am trying to visualise this data in QuickSight but am getting different insights vs when I run SQL queries on the data itself. Later I will edit the Glue jobs to export different 'cuts' of data to S3 and then visualise.

What I tried and the issues?

  1. I downloaded the S3 CSV to ensure the CSV is generating the same output as SQL queries on the RDS table and it matches 100%. This means the issue is with QuickSight, not Glue.
  2. I tried to connect the S3 file using 'S3' as a data source in QuickSight. I used the script provided below. The problem is it uses SPICE and I'm trying to run a direct query only (as my data will grow 100x soon, this is just a POC). This code is what is used as the manifest:

    {
        "fileLocations": [
            {
                "URIs": [
                    "s3://pathofglueoutput/part-00000-blahblah.csv"
                ]
            },
            {
                "URIPrefixes": [
                    "s3://pathofglueoutput/"
                ]
            }
        ],
        "globalUploadSettings": {
            "format": "CSV",
            "delimiter": ",",
            "containsHeader": "true"
        }
    }
    
  3. I tried to connect the S3 file to Athena, and then import the data from Athena to QuickSight. My issue is that Athena won't 'refresh daily' so this is not a scalable approach, hence I was looking to use S3 (option 2 above) anyway.

I am struggling why the data from the ETL output of Glue (e.g. select count(field1) of s3data) and the same count on QuickSight (count - not distinct - of field1 of s3data) is completely different.

Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
charlesperry
  • 105
  • 1
  • 2
  • 7
  • Turns out it was an ETL issue - the nulls were not actually null in the data causing issues with counts. Used this link to resolve: https://forums.databricks.com/questions/790/how-do-i-replace-nulls-with-0s-in-a-dataframe.html – charlesperry Oct 24 '19 at 10:27

0 Answers0