Data Inconsistencies After ETL in QuickSight?

Question

I have raw data in RDS. I am using AWS Glue to crawl the data, and export the data through an ETL script in Glue (a full exact copy, no transformations or edits for now) to S3 as a single CSV file. I am trying to visualise this data in QuickSight but am getting different insights vs when I run SQL queries on the data itself. Later I will edit the Glue jobs to export different 'cuts' of data to S3 and then visualise.

What I tried and the issues?

I downloaded the S3 CSV to ensure the CSV is generating the same output as SQL queries on the RDS table and it matches 100%. This means the issue is with QuickSight, not Glue.

I tried to connect the S3 file using 'S3' as a data source in QuickSight. I used the script provided below. The problem is it uses SPICE and I'm trying to run a direct query only (as my data will grow 100x soon, this is just a POC). This code is what is used as the manifest:

{
    "fileLocations": [
        {
            "URIs": [
                "s3://pathofglueoutput/part-00000-blahblah.csv"
            ]
        },
        {
            "URIPrefixes": [
                "s3://pathofglueoutput/"
            ]
        }
    ],
    "globalUploadSettings": {
        "format": "CSV",
        "delimiter": ",",
        "containsHeader": "true"
    }
}

I tried to connect the S3 file to Athena, and then import the data from Athena to QuickSight. My issue is that Athena won't 'refresh daily' so this is not a scalable approach, hence I was looking to use S3 (option 2 above) anyway.

I am struggling why the data from the ETL output of Glue (e.g. select count(field1) of s3data) and the same count on QuickSight (count - not distinct - of field1 of s3data) is completely different.

Turns out it was an ETL issue - the nulls were not actually null in the data causing issues with counts. Used this link to resolve: https://forums.databricks.com/questions/790/how-do-i-replace-nulls-with-0s-in-a-dataframe.html — charlesperry, Oct 24 '19 at 10:27

Data Inconsistencies After ETL in QuickSight?

0 Answers0