2

I need to process 600 million of records in multithreaded way and each request takes 5-6 seconds. In boot application i need to create 1000 threads but tomcat supports 200 only. what is the best way to proceed?

Udayan
  • 31
  • 3

4 Answers4

3

You can totally control the number of threads Tomcat creates in /apache-tomcat/conf/server.xml:

<connector connectiontimeout="20000"
           maxthreads="1000"
           port="8080"
           protocol="HTTP/1.1"
           redirectport="8443" />

You can do this up to your OS limit for threads. It's 2000 on a mac.

But I think creating 1000 threads isn't going to help you very much. Loosely, you can only execute as many simultaneous threads as you have cores on your machine.

So with a 4 core machine it'll take ~24 years to process your 600 million records. With 32 cores you will get it down to a single digit number of years.

What would I do? I would look into something like Apache Beam that will parallelize your workflow across many, many machines. Take a look at https://cloud.google.com/dataflow/. You can create your task to requisition 1000 4 core machines. google will spin them up and tear them down for you. The job would take about 9 days. Back of the envelope calculation shows that getting your answer will cost you about $8,640

Robert Moskal
  • 21,737
  • 8
  • 62
  • 86
1

If you want to stay efficient you most likely don't want to use 1000 threads unless your machine has 1000 CPUs. If your tasks are CPU bound then then the number of worker threads should be close to CPUs count otherwise you will waste cycles on CPU Scheduling.

Since your question lacks any technical details I'd suggest to close it. Write a new one explaining the basics of your problem:

  • How are you receiving requests? Over HTTP? LAN or WAN? Can it be changed to something else e.g. because request data is generated from an external database.
  • How are you processing the requests? Is it CPU bound calculation or are you making fan out requests to other systems to enrich the data.
  • How are you saving the processing results?
  • How do you plan to handle failures? If one request processing fails do you plan to repeat 600 mln requests?
Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • actually i am calling soap (http request) and response time is 3-5 seconds, then the response is analysed and making another rest call to store data in google cloud. – Udayan Dec 22 '19 at 16:07
  • failure handle: if one call fails it will skip and proceed. – Udayan Dec 22 '19 at 16:07
0

If Spring usage is must you can checkout Spring Cloud Data Flow instead of Apache Beam.

If you want to accomplish this by only using Tomcat & Spring Boot you must have to scale up the number of instances. Scaling up will provide more cores, and may not be the best way to do it.

Also I would suggest to use Tomcat with NIO, which will increase performance.

rv.comm
  • 675
  • 1
  • 7
  • 10
  • ok, i will check this as you have suggested. – Udayan Dec 22 '19 at 16:09
  • let me explain a bit. i am reading excel sheet and based on data in sheet i am creating payload and calling soap, then after getting response, i am checking response and some part of response data storing in google cloud. one thread can execute the process in 5 sec. – Udayan Dec 22 '19 at 17:34
  • @Udayan based on what you said I would suggest this [https://cloud.google.com/functions/use-cases/real-time-data-processing] as the optimal way to do it. This way you do not have to maintain the server, container...you just have to maintain the Cloud Function (your SOAP call and storing it to google cloud). – rv.comm Dec 23 '19 at 13:03
0

What happens in those 5-6 seconds? Does it do a computation using CPU, or is it sending data to somewhere else and waiting for it to return?

In the second case, you don't need to spin up 1000 threads to do 1000 queries in parallel, but you can use @Async if the other backend supports it. You would have only a small pool of input and output threads.

You can use Spring WebFlux for that. WebFlux does not use tomcat, however, but a custom HTTP server built on Netty, see e.g. https://www.baeldung.com/spring-webflux.

This can only work if you can execute each step in a reactive way. In your case, do a SOAP call use the reactive WebClient to send the data without blocking, and subscribe a second non-blocking process on the SOAP response to upload the data to google cloud.

GeertPt
  • 16,398
  • 2
  • 37
  • 61
  • hi Grey, i am calling soap and response coming in 3-4 seconds, then i am manipulating the response and storing in google cloud. – Udayan Dec 22 '19 at 16:30
  • @Udayan your use case is perfect for WebFlux, assuming you can do the upload to google cloud using a non-blocking call, too. – GeertPt Dec 28 '19 at 21:53