0

Given a set of 1 million (very large) no. of URL's. Find the "first" "unique" URL from the list.

My Approach: Build a hash using perfect hashing function, that can help. But my question is to hash large data is not possible., then how can I solve this question.

Is there any method to do inplace? Please help. Thanks in advance.

devsda
  • 4,112
  • 9
  • 50
  • 87

1 Answers1

1

Given an input list of ["c","a","b","a","c"], my first approach would be:

  1. Convert the list of URLs into a list of tuples which associates each element which its position in the list. Now you have [(0,"c"),(1,"a"),(2,"b"),(3,"a"),(4,"c")].
  2. Sort the list lexicographically by the second tuple element (the URL). Now you have [(1,"a"),(3,"a"),(2,"b"),(0,"c"),(4,"c")].
  3. Group sequences of subsequent equal tuples (a tuple is equal if the second element equals) into sub-lists. Now you have [[(1,"a"),(3,"a")],[(2,"b")],[(0,"c"),(4,"c")]].
  4. Filter the list so that you only have lists of length 1. Now you have [[(2,"b")]].
  5. If the resulting list is empty, there is no unique URL in the list. If it is non-empty, Sort the list by the first tuple element (the position in the string). In this case, you get the same list back - [[(2,"b")]].
  6. take the first element of the list. Now you have [(2,"b")].
  7. The (only) tuple in this list tells you the first unique URL, and the position in the input list: it's the URL b at position 2 in the input list.
Frerich Raabe
  • 90,689
  • 19
  • 115
  • 207
  • 1
    but Frerich, I think you tried to implement Map/Reduce concept. First map, then sort, then make pair. – devsda Aug 31 '12 at 07:14
  • But this can't help, This may possible that Z comes first, A comes last, both occurs only one time. But after sorting A comes before than Z. and gives anser as A, but actually answer is Z – devsda Aug 31 '12 at 07:17
  • @jhamb: So? I was thinking of Haskell while writing this algorithm (that's why the list/tuple notation is like that). I think it would be rather straightforward to implement and I suspect that Haskell's lazyness could make this reasonably efficient, too. – Frerich Raabe Aug 31 '12 at 07:18
  • @jhamb: No, that's why in step (5) you sort the list of unique characters again, this time by their position (the first tuple element), so e.g. `[(2,"a"),(0,"z")]` becomes `[(0,"z"),(2,"a")]`. – Frerich Raabe Aug 31 '12 at 07:19
  • So, which data structure that you use to implement this? To check URL is it necessary to parse URL? – devsda Aug 31 '12 at 07:25