Improving performance of preprocessing large set of documents

Question

I am working on a project related to plagiarism detection framework using Java. My document set contains about 100 documents and I have to preprocess them and store in a suitable data structure. I have a big question that how am i going to process the large set of documents efficiently and avoiding bottlenecks . The main focus on my question is how to improve the preprocessing performance.

Thanks

Regards Nuwan

Improve what performance? You haven't written anything yet so you don't know what is or what might be a bottleneck. We don't have enough information to guess what type of preprocessing you are doing. 100 documents doesn't seem like a large number to me. — camickr, Apr 17 '11 at 04:26
You should make your question more specific by providing some information about what format the documents start in and what target data structure looks like. In addition, you should provide some information about how long it currently takes the amount of time you need it to take. — ChrisH, Apr 17 '11 at 04:27

score 0 · Answer 1 · answered Apr 17 '11 at 04:27

You're a bit lacking on specifics there. Appropriate optimizations are going to depend upon things like the document format, the average document size, how you are processing them, and what sort of information you are storing in your data structure. Not knowing any of them, some general optimizations are:

Assuming that the pre-processing of a given document is independent of the pre-processing of any other document, and assuming you are running a multi-core CPU, then your workload is a good candidate for multi-threading. Allocate one thread per CPU core, and farm out jobs to your threads. Then you can process multiple documents in parallel.
More generally, do as much in memory as you can. Try to avoid reading from/writing to disk as much as possible. If you must write to disk, try to wait until you have all the data you want to write, and then write it all in a single batch.

Thanx for the reply. My problem is that I want to read 100s of documents. And if I do it it in sequential manner it will take lots of time. So I want a effient parellel algorithm ( using Threads etc )to read the documents fast. And the token read from a document should be stored in a suitable data structure for the document comparison purposes. — Nuwan, Apr 22 '11 at 03:35

score -1 · Answer 2 · answered Apr 17 '11 at 04:27

-1

You give very little information on which to make any good suggestions.

My default would be to process them using an executor with a thread pool with the same number of threads as cores in your machine each thread processing a document.

answered Apr 17 '11 at 04:27

Tom

43,583
4
41
61

Ok. I got your point. Thanx. Another thing I want to know is that what is the most efficient data structure to store the preprocessed to token of a documents. This is crucial because I have to deal with these preprocessed document tokens(words) frequently in the document comparison stage. Thnx – Nuwan Apr 22 '11 at 03:39

Improving performance of preprocessing large set of documents

2 Answers2

Linked

Related