0

I have a data mining app.

There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.

Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.

My questions are:

  1. Since I parallelized the list and thus the computation, shouldn't the compute-time decrease? I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?

  2. Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?

Cœur
  • 37,241
  • 25
  • 195
  • 267
kliew
  • 3,073
  • 1
  • 14
  • 25

2 Answers2

0

Without more details, any answer is a guess. However, even a guess might point you to the right direction.

  1. Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
  2. The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
nedim
  • 1,767
  • 1
  • 18
  • 20
0
  1. How many threads for reading and writing to the hard disk?
  2. The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.

I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

Community
  • 1
  • 1
kliew
  • 3,073
  • 1
  • 14
  • 25