2

i have a set of strings along with their co-ordinates and rectangular bounds int two similar pages. these strings are different in three possible ways. (i) a string can be moved to a new location on a page. (ii) a string is in the same location but it is modified. say ( abc --> abd or ABC) (iii) a combination of (i) and (ii). (iv) a string might be missing.

i tried to use locality sensitive hashing but couldn't find a good hash function for this. Can anyone please suggest me a good hash function or another algorithm to solve this problem. thanks in advance

programer8
  • 567
  • 1
  • 6
  • 17
  • What's your cost function? For example, how many characters can be different in a match before you'd prefer the algorithm to instead report that the string's missing? – Andy Jones Dec 09 '13 at 23:44
  • for now it's min(wordsize/2,4). – programer8 Dec 09 '13 at 23:57
  • Couple of other things: is every string in the "target" page present in the "source" page? Do the strings change length, or are there only letter substitutions? Are the strings disjoint? – Andy Jones Dec 10 '13 at 00:07
  • Oh, and do you know the boundaries between the strings in the target page as well as in the source page? – Andy Jones Dec 10 '13 at 00:13
  • not necessarily....that's listed as case (iv) and no the strings need not be disjoint.a similar string cam be present in many locations.yes i know the location of the string in the page including the rectangular bounds of the string. – programer8 Dec 10 '13 at 00:33
  • Okay. Is this an accurate phrasing of your problem then: "Given a set of strings S and another set T such that |S| >= |T|, find an injective map from S to T that minimizes the sum of a cost function?" – Andy Jones Dec 10 '13 at 00:45
  • nope not exactly.a few strings in s might not have a match in T and a few strings in T might not have a match in s. – programer8 Dec 10 '13 at 00:48

1 Answers1

2

So we have a list of source strings S and a list of target strings T of size at most |S|. We want find a way to assign each string in T to a distinct string in S such that the total number of mismatched characters is minimized

(Note that because we're looking for a way to match T to S, the case where a string in S is missing is dealt with implicitly)

If this is an accurate interpretation of your problem @programer8, I believe this is an assignment problem and can be solved by the Hungarian algorithm in cubic time: the "workers" referred to in the wiki article are your target strings, the "tasks" are the source strings, and the number of mismatched characters between a source and a target string is the cost of a worker performing a task.

The only hiccup is you have fewer workers/target strings than tasks/source strings, but you can remedy that by adding dummy workers.

Andy Jones
  • 4,723
  • 2
  • 19
  • 24
  • i checked this algorithm a few days ago.this is not feasible because of the run time but this will get the job done. – programer8 Dec 10 '13 at 01:41
  • What runtime does it need to be feasible? – Andy Jones Dec 10 '13 at 01:44
  • And are you willing to accept approximations to the solution? If so, how good do they need to be? – Andy Jones Dec 10 '13 at 01:56
  • if i do accept approximations...can you suggest me a better algorithm ? do you have an algorithm already on your mind? – programer8 Dec 10 '13 at 05:37
  • I was thinking along the lines of a greedy algorithm like [this one](http://download.peerialism.com/papers/ExtendedAbstract.pdf), but I've realised that even if you can find a sub-cubic time solution for the assignment problem, it doesn't help much because getting the O(n^2) costs can take up to cubic time anyway. To go any faster I think you're right that we need to exploit something about the string-y nature of the problem, rather than abstracting away to a general assignment problem. Will keep thinking about it. – Andy Jones Dec 10 '13 at 07:47