Using parallel algorithms when reading documents

Question

Possible Duplicate:
Improving performance of preprocessing large set of documents

Hi, I have a document set contain about 100 documents. I have to preprocess each of these documents and compare these documents with each other. If I do it in sequential manner it will consume huge amount of time. So I want to know some parellel algorithms that can be used and how can i implement those using Java.

Ragards, nuwan

Well can you choose one document and compare the other 99 documents to it and give each a score which measure how different they are from the chosen document? Then you only choose the documents that have similar scores and compare them against each other. — Apprentice Queue, Apr 22 '11 at 04:17
Define "compare". What kinds of differences are you looking for? Are you comparing for the purpose of sorting in some way? — Karl Knechtel, Apr 22 '11 at 05:24

Stephen C · Answer 1 · 2011-04-22T04:44:32.210

There is a lot of literature about detecting document similarity. You need to do a literature search and/or a web search for software / algorithms / techniques that matches your requirements.

Simply replacing a brute-force sequential pair-wise comparison with a brute-force parallel pair-wise comparison is not the answer. That approach only gives you an O(P) speedup (at best), where you have to deal with O(N^2 * S^2) where N is the number of documents and S is the average document size.

For a start, the classic way of finding similarities between two large text files involves breaking each file into lines, calculating hashes of each the respective file's lines, sorting the hashes and comparing them. This process is O(SlogS) ...

score 1 · Answer 2 · answered Apr 22 '11 at 04:35

If you have documents d1, d2, d3, d4 - if you compared each document with all other documents, then it would be O(N^2). However, I assume that comparing d1 to d2 is the same as comparing d2 to d1, so you can optimize there. So basically, you only need to compare d1-d2, d1-d3, d1-d4, d2-d3, d2-d4, d3-d4, which is O((N-1)!).

Perhaps start by building a map of all comparisons that need to be done. Then, split that map into X equal size collections, where X is the number of processes you want to run. Finally, spin off that many threads (or farm the work out to that many servers), and let them run, then merge the results back together.

If you need to preprocess each document individually (so the comparisons really don't matter at that point), then just break the problem up into as many processes as you want, and distribute that work across the processes. Without really know what kind of preprocessing and comparison and document types you're dealing with, I can't really get into much more specifics than that.

Hi, Preprocessing involves stemming, synonym replacer and stop word remover. And the purpose is to find the amount of plagiarism between documents. — Nuwan, Apr 22 '11 at 10:58

score 0 · Answer 3 · answered Apr 22 '11 at 05:20

I'm assuming your looking for similarities between documents rather than identical documents - if that were the case you could generate a checksum for each document in parallel and then comparing then would be relatively easy.

For similarities you could use a fingerprinting approach. I have a friend how uses this for looking for text reuse in a large corpus of documents. You can calculate the fingerprints for each document in parallel and then load the fingerprints to do the match in memory and parallel.

Winnowing: Local Algorithms for Document Fingerprinting

Using parallel algorithms when reading documents

3 Answers3