1

I read a while about how to upload my S3 data to Redshift, COPY command, Glue, etc. My pipeline is almost entirely in NIFI something like: extract_data->insert to S3->excecute Lamda process to transform the data or enrich it using Athena, in 2 or 3 stages, into another S3 bucket (lets call it processed bucket).

Now I want to continue this pipeline, loading the data from the processed bucket and inserting it to redshift, I have an empty table created for this.

The idea is to add incrementally in some tables and in others to delete all the data loaded that day and reload it.

Can anyone give me a hint of where to start? Thank you!

Alejandro
  • 519
  • 1
  • 6
  • 32

1 Answers1

0

When data lands in your "processed bucket", you can fire a lambda function, that triggers a flow in Apache NiFi by calling an HTTP webhook. To expose such a webhook you use one of the following processors:

ListenHTTP

Starts an HTTP Server and listens on a given base path to transform incoming requests into FlowFiles. The default URI of the Service will be http://{hostname}:{port}/contentListener. Only HEAD and POST requests are supported. GET, PUT, and DELETE will result in an error and the HTTP response status code 405.

HandleHttpRequest

Starts an HTTP Server and listens for HTTP Requests. For each request, creates a FlowFile and transfers to 'success'. This Processor is designed to be used in conjunction with the HandleHttpResponse Processor in order to create a Web Service

So the flow would be ListenHTTP -> FetchS3 -> Process -> PutSQL (with Redshift connection pool). The lambda function would call GET my-nifi-instance.com:PORT/my-webhook, such that ListenHTTP creates a flowfile for the incoming request.

DarkLeafyGreen
  • 69,338
  • 131
  • 383
  • 601