2

I want to receive data from multiple url. You can think, each url is represent one device. I can create flow that starting with getHttp for each device. But this scenario so bad for me. Another option, i can create flow that starting with generateFlowFile(each url defined in this processor), then split, and send this urls to invokeHttp processor. But each url will work in a sequential. So, I can loss data from others when i send request to one url.

What can i do in this case?

Edit: For my use case, firstly, i must receive data from multiple url. Then i will send these data to Kafka after apply some transformations. But I have to get data from almost 50 or more URLs. I need to do this in real time and in a scalable way in a nifi cluster.

lifeisshort
  • 283
  • 4
  • 20

1 Answers1

1

Use the same flow as mentioned in the question:

Described Flow in Question:

1.GenerateFlowFile
2.Split Text
3.Extract Text

Then feed the success relationship of ExtractText processor to RemoteProcessorGroup(to distribute the load across clusted).

Then get the flowfile that are distributed feed them to InvokeHTTP processor and schedule the processor to run more than one concurrent tasks in Scheduling Tab.

Then use PublishKafkaRecord processors and define Record Reader/Writer schema, Change the schedule to run more than one concurrent task.

Final Flow:

1.GenerateFlowFile
2.SplitText
3.ExtractText
4.RemoteProcessorGroup (or) ConnectionLoadBalance(Starting NiFi-1.8.0)
5.InvokeHTTP //more than one concurrent task
6.RemoteProcessorGroup (or) ConnectionLoadBalance(Starting NiFi-1.8.0) //optinal 
7.PublishKafkaRecord //more than one concurrent task

Try with the above flow and i believe Kafka processors are very scalable, give you good performance as you are expected :)

In addition

Starting from NiFi-1.8 version we don't need to use RemoteProcessGroup (to distribute the load) as we can configure Connections(relationships) to distribute the load balancing.

Refer to this and NiFi-5516 links for more details regards to these new additions in NiFi-1.8 version.

notNull
  • 30,258
  • 4
  • 35
  • 50
  • Firstly, thank you so much your answer. Now I'm starting to understand better. If I ask again to make sure I understand; When I run invokeHttp and PublishKafkaRecord with multiple concurrent task on the entire cluster, can I get data from all the urls fast enough and scalable? (assume that distribute the load is done with relationships). Finally, assume that i working with Nifi-1.8, RemoteProcessorGroup will be deleted. – lifeisshort Dec 08 '18 at 18:47
  • @AdamJ. It all depends on how much data is being processed at one time and distributing the load will not affect any performance instead it helps to process the data `more efficiently` and also we are running more than one **concurrent tasks**. in NiFi-1.8 we don't have to use `RemoteProcessorGroup`, as we can distribute load in relationships. Please do load/performance tests with your data and tune the processors to get most out of NiFi. https://community.hortonworks.com/articles/16461/nifi-understanding-how-to-use-process-groups-and-r.html – notNull Dec 08 '18 at 19:12
  • Okey, thank you again. I will create this flow, and then apply load/performance tests. Also, i will edit this question again, and share tests results after apply tests. – lifeisshort Dec 08 '18 at 19:23
  • @AdamJ. Great..!!, I forgot to mention it would be good to have load balancing after `InvokeHTTP` processor also, Test out the performance `With Connection Loadbalancing after InvokeHTTP` vs `Without Connection Loadbalancing after InvokeHTTP` and use the load balancing if the performance is significant different. – notNull Dec 08 '18 at 20:08
  • Okay, i am gonna try this. Thank you so much again. – lifeisshort Dec 08 '18 at 20:42