3

I am using multithreading to process huge # of records coming through file. Each line is one record and I pass each line to separate thread for processing.The problem is that I have to collect these processed record and some more data generated while processing the record and then apply some business logic on final collection of data. I pass a common ConcurrentHashMap to all the threads to populate the processed data and when I debuged it through visualVM I found(screenshot as below) that these threads are spending lot of time in waiting than running. I suppose that is because of the lock one thread acquire while writing to ConcurrentHashMap.

Is there a way I could implement complete asynchronous behavior to achieve my goal?

visualVM snapshot

enter image description here

Varun Maurya
  • 337
  • 4
  • 18
  • 2
    How many threads do you have, and how many cores ? I suspect (given your screenshot of threads #188 to #202) that you have way more threads than cores, and you consequently have a scheduling issue around this – Brian Agnew Mar 16 '17 at 15:35
  • there are 15 core on the linux box and we are spawning only 10 threads as of now. CORE counted by : **cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1** – Varun Maurya Mar 16 '17 at 15:38
  • # of thread is more because we process many files and everytime we are killing the executor , So when next time i create new executor it shows the thread # incremented from last one – Varun Maurya Mar 16 '17 at 15:41
  • Ok. That makes some sense. I'd perhaps post some code then – Brian Agnew Mar 16 '17 at 15:45
  • do the line-processing threads write to the hashmap many times during execution or just once at the end? – nandsito Mar 16 '17 at 15:45
  • based on different business logic, it could write multipletimes – Varun Maurya Mar 16 '17 at 15:46
  • 2
    This may be off topic, but isn't that a perfect map-reduce problem that you are trying to recreate? It might be worth looking into splitting the scalability/threading part into one of the existing frameworks and then only focusing on the actual business logic. – pandaadb Mar 16 '17 at 15:50
  • i agree that the threads might be spending too much time blocked by the concurrent hash map. The answer would be redesign your asynchronous solution with as less critical sections as possible – nandsito Mar 16 '17 at 15:57
  • **What if I use Future ?** Instead of passing common HaspMap to all threads , let each thread create its own map and return through future which i can add to another collection. This way no one has to wait for any lock? Does it sound good ? – Varun Maurya Mar 16 '17 at 16:06
  • Why don't you redesign your approach? You can setup a thread that is completely responsible for writing to a hashmap. (**ONLY** writing!!). Let the data being supplied by a fixed set of reading streams that reads and handles the data. (Don't spawn one for each line ...). Introduce queues and ect ... – KarelG Mar 16 '17 at 16:12
  • If each record is independent then what needs to be synchronized? Can you put the record, as part of a composite object, in the Map before handing off the composite object (not the Map) to a thread? The thread will do it's work on the object and be unaware of the Map. Once all the threads are complete it's safe to use the Map for additional processing. – Andrew S Mar 16 '17 at 16:24
  • 2
    Re, "...everytime we are killing the executor..." That is completely contrary to how a thread pool is supposed to be used. Unless your application is unusually complicated, you should create _just one_ ExecutorService when your program starts up, configure it appropriately for the number of cores on your machine and the type of work that it's going to do, and then use that one ExecutorService to perform all of your "background"/"parallel" tasks. – Solomon Slow Mar 16 '17 at 17:48
  • Does it looks suspicious to anybody else that when you look at a points where some thread stops working there are a few other threads that stop exactly at the same time, moreover they are all released after that at the same time as well? Also @Varun says that "we are spawning only 10 threads as of now" but I clearly see 15 threads on the image. To me this looks like either there are actually many more threads than OP believes or there is a contention around **single** lock held by some thread for long time, most probably not the one on the image. – SergGr Mar 16 '17 at 18:10
  • @SergGr in actual env we are spawning 10 thread, The screenshot which i shared is frommy local where i just changed # of threads to 15 – Varun Maurya Mar 16 '17 at 19:41
  • 3
    You are on the good track with "let each thread create its own map and return through future which i can add to another collection". Right now, every time 2 threads want to write to your HashMap, one will wait for completion of the other, which is definitely an obvious bottleneck in your application – Adonis Mar 19 '17 at 14:28

2 Answers2

1

I found that these threads are spending lot of time in waiting than running. I suppose that is because of the lock one thread acquire while writing to ConcurrentHashMap.

That's not a good assumption - ConcurrentHashMap is quite efficient and designed to be used concurrently. Even if it does have some contention it's far from the first place I would look in a case like this.

What other work are these threads doing? I/O is a blocking operation (and synchronous, if reading/writing to the same disk) and if multiple threads are doing I/O that's going to impact your throughput orders-of-magnitude more than ConcurrentHashMap contention.

Instead of having each thread do its own I/O consider having a dedicated I/O thread that reads what's needed from disk and dispatches that data to dedicated processing threads via an executor. The I/O thread can then write the results back to disk (assuming that's desired) as the futures complete. Using Java's async I/O framework would also allow you to avoid idling threads.

Community
  • 1
  • 1
dimo414
  • 47,227
  • 18
  • 148
  • 244
0

Just to let everyone know that the waits in worker threads were due to the db calls each worker thread is making and the pool had only 10 db connections available at a time. ConcurrentHashMaps did had the wait when we increased the number of worker threads but that is very small delay and acceptable.

Thanks everyone for your suggestions.

Varun Maurya
  • 337
  • 4
  • 18