2

I have a java program which goes to some websites, converts the website's HTML into XML, then runs some xquery commands on the XML, finally stores the result into csv, which is then uploaded into Cloud file storage (like Amazon S3).

Now, I want to split the work into multiple threads so that it is done faster-- but how do I determine the number of threads that is optimum for my work?

I want to determine the number of threads that I should allow, for the different types of Amazon EC2 instances... Is there a library or framework that can help me with this?

Or, do I have to manually run the code on an Amazon EC2 instance, and keep changing the number of threads, and measure the time taken?

Specifically, I want to keep a balance between total time taken to process all threads, versus the number of threads that are allowed to run simultaneously... And if I could clearly see this correlation for different servers with different CPU/RAM capacities that would be great...Any advice/guidance would be appreciated...

Arvind
  • 6,404
  • 20
  • 94
  • 143
  • As a general guideline, more thread than the number of cores on a machine fail to speed up processing, though there are some exceptions, especially if the threads are waiting on other tasks to complete and are not fully utilized. – HenryZhang Aug 17 '12 at 16:35

3 Answers3

4

The type of work you describe is almost certainly I/O bound -- most of the time is spent waiting for data to be downloaded or uploaded. If so, your goal is simply to make full use of upload / download bandwidth.

If so, the optimal number of threads will be more than the number of physical cores on the machine (which would be the right place to start for a CPU-bound process).

It's hard to say from this info what the optimum number of threads will be as it depends on how much you're downloading and how fast the link is. Try doubling the number of threads until performance starts to suffer.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
2

I think you should profile your app with single thread using JHAT, MAT, etc... and then decide how many thread based on machine config you want to run. It will give you a general idea of how expensive your thread is. You can then run load test (like 10,000 items queued up against 10 threads) to validate the limits that you came up with, and tune accordingly.

Nishant
  • 54,584
  • 13
  • 112
  • 127
1

To find the number of logical cores available you can use:

int processors = Runtime.getRuntime().availableProcessors();

and create a ThreadPool with that many. See also :

Finding Number of Cores in Java

Java: How to scale threads according to cpu cores?

Community
  • 1
  • 1
Garrett Hall
  • 29,524
  • 10
  • 61
  • 76
  • I have one more query for you - as far as I understand an Amazon EC2 instance (or any other VPS or cloud based virtual machine instance) has part of the CPU+memory of a typical server assigned to it-- will the above method work correctly with such virtual machines? – Arvind Aug 17 '12 at 16:57
  • 1
    I would expect it to work if Amazon's virtual cores map to actual cores (or hyperthreaded cores) you are given (it would be misleading if they charged you for additional virtual cores but didn't give you real cores). – Garrett Hall Aug 17 '12 at 17:21