A java project I created and is currently live in PROD is I/O intensive. I want refactor it to optimize performance - not that I am asked to do it but I feel still scope for improvement is there. So handle it before it's too late. Few steps can be parallelized and utilize multiple cores better.
What service does?
It is a webservice which simply ingests a file and sftp-ies them to remote sftp server over network(over internet not within company intranet). There are 2 sftp sites. So the service decides to which server to sftp by metadata sent in the request itself. Also it has 2 jobs running periodically which polls on timed delay of 5 minutes over these 2 sftp-sites and pulls zip files if any available.
What job does: Job pulls all available zips to local folder one by one. Then starts processing each zip(via looping over zips collection). First it extracts zip and then takes 1 pdf file and sends to another webservice(say service 1) within company network. Then it takes one xml file, parses it and extracts certain data from it and then gives that data to another service(say service 2).
What I plan to do? That's too much work to do in a single job. I plan to split it -> Job will just pull zips into local folder and push the names in a BlockingQueue which will ignite another job and processing will be done by it i.e. extract zip can processed in parallel with pulling zip from remote sftp-server. Now my query is that both pulling zip from remote to local and processing zip in local are actually I/O operations but since first is I/O over network and another local file I/O I think data channel/bus used is different. So if I parallelize them it will improve the performance. I need to do this because in coming future number of zips is going to increase say 1000's of zips at one go which is very slow with current implementation.
Also will implement connection pool for sftp connection(currently there is none and I realize it's a must). Also for 2 proposed jobs
1)pulling zips from remote and
2)processing zips locally
I will use thread pools (as per tutorial Parallel and Asynchronous Programming if the service is I/O intensive number of thread can even to 10 times of the core. Offcourse benchmarking needs to be done. But just conceptually that's good for heads on start).
Does this restructuring makes sense? What else can be done?