I have a java program which goes to some websites, converts the website's HTML into XML, then runs some xquery commands on the XML, finally stores the result into csv, which is then uploaded into Cloud file storage (like Amazon S3).
Now, I want to split the work into multiple threads so that it is done faster-- but how do I determine the number of threads that is optimum for my work?
I want to determine the number of threads that I should allow, for the different types of Amazon EC2 instances... Is there a library or framework that can help me with this?
Or, do I have to manually run the code on an Amazon EC2 instance, and keep changing the number of threads, and measure the time taken?
Specifically, I want to keep a balance between total time taken to process all threads, versus the number of threads that are allowed to run simultaneously... And if I could clearly see this correlation for different servers with different CPU/RAM capacities that would be great...Any advice/guidance would be appreciated...