0

I've been working on a project that uses a fairly simple data pipeline to clean and transform raw csv files into processed data using Python3.8 and Lambda to create various subsets which are sent to respective S3 buckets. The Lambda function is triggered by uploading a raw csv file to an intake S3 bucket, which initiates the process.

However, I would like to also send some of that processed data directly to Quicksight for ingestion from that same Lambda function for visual inspection as well, and that's where I'm currently stuck.

A portion of the function (leaving out the imports) I have with just the csv processing and uploading to S3, and this is the portion I like direclty ingested to Quicksight:

def featureengineering(event, context):
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    s3_file_name =  event['Records'][0]['s3']['object']['key']
    read_file = s3_client.get_object(Bucket=bucket_name,Key=s3_file_name)
   
    #turning the CSV into a dataframe in AWS Lambda
    s3_data = io.BytesIO(read_file.get('Body').read())
    df = pd.read_csv(s3_data, encoding="ISO-8859-1")

    #replacing erroneous zero values to nan (missing) which is more accurate and a general table,
    #and creating a new column with just three stages instead for simplification
    df[['Column_A','Column_B']] = df[['Column_A','Column_B']].replace(0,np.nan) 
    #applying function for feature engineering of 'newstage' function    
    df['NewColumn'] = df.Stage.apply(newstage) 
    
    df1 = df
    df1.to_csv(csv_buffer1)
    s3_resource.Object(bucket1, csv_file_1).put(Body=csv_buffer1.getvalue()) #downloading df1 to S3

So at that point where the df1 is sent to its S3 bucket (which works fine), but I'd like it directly ingested into Quicksight as an automated spice refresh as well.

In digging around I did found a similar question with an answer

import boto3
import time
import sys
client = boto3.client('quicksight')
response = client.create_ingestion(DataSetId='<dataset-id>',IngestionId='<ingetion-id>',AwsAccountId='<aws-account-id>')

but the hang up I'm having is in the DataSetId or more generally, how do I turn the pandas DataFrame df1 in the Lambda Function into something the CreateIngestion API can accept and automatically send to QuickSight as an automated spice refresh of the most recent processed data?

JLuu
  • 379
  • 1
  • 3
  • 17

1 Answers1

1

You should first create a Quicksight Dataset, quoting from the docs:

A dataset identifies the specific data in a data source that you want to use. For example, the data source might be a table if you are connecting to a database data source. It might be a file if you are connecting to an Amazon S3 data source.

Once you have saved your DataFrame on S3 (either as a csv or parquet file), you can create a Quicksight Dataset that sources data from it.

You can do so either via Console or programmatically (probably what you’re looking for).

Finally, once you have the Dataset ID you can reference it in other Quicksight API calls.

Andre.IDK
  • 1,951
  • 13
  • 17
  • Interesting. So it cannot be done directly from the processing Lambda function itself? Would the DataSetID have to be changed/upadated with each new csv being processed? If so, that would seem to defeat the ability to have the ingestion fully automated. – JLuu Sep 24 '20 at 00:00
  • Yes, it can be done in the same Lambda function. The third link (the one on the word "programmatically") shows how to do it with the AWS SDK. – Andre.IDK Sep 24 '20 at 10:25
  • Hi @Adnre.IDK thank you. Just wanted to clarify, though, in your original response and in the documentation on creating a quicksight dataset indicates that it would be pulling it from S3, so the DataFrame being processed would first have to uploaded to an S3 Bucket, but the SDK sending it to QuickSight can still be in the Lambda Function doing the processing? I'm just wrapping my mind around the steps and the timing of when things are being triggered and moved through the data pipeline. Interesting it seems the DataFrame can't be sent directly from Lambda to QuickSight like it is with S3 – JLuu Sep 24 '20 at 16:32
  • 2
    To understand what is a dataset I would suggest you to think of it more like a reference to some data somewhere rather than an actual DataFrame. QuickSight itself doesn't store data but uses entities called datasets to display the referenced data. This means that, at least in your case, in order to make some data available to QuickSight you'll have to read the data, process it, store it on S3, create a QuickSight Dataset that points to that data on S3 and then finally tell QuickSight to use it. All this can be in one lambda provided that execution time and complexity can be handled in it. – Andre.IDK Sep 24 '20 at 16:37
  • that's super interesting and good to know. Thank you! – JLuu Sep 25 '20 at 15:48