2

Suppose we wish to implement Local Sensitive Hashing(LSH) by MapReduce. Specifically, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values).

(a) Show how to produce the buckets for all the bands as output of a single MapReduce process. Hint: Remember that a Map function can produce several key-value pairs from a single element.

(b) Show how another MapReduce process can convert the output of (a) to a list of pairs that need to be compared. Specifically, for each column i, there should be a list of those columns j > i with which i needs to be compared.

WeiYuan
  • 5,922
  • 2
  • 16
  • 22

1 Answers1

5

(a)

  • Map: the elements and its signature as input, produce the key-value pairs (bucket_id, element)
  • Reduce: produce the buckets for all the bands as output, i.e. (bucket_id, list(elements))

map(key, value: element):
    split item to bands
    for band in bands:
        for sig in band:
            key = hash(sig) // key = bucket id
        collect(key, value)

reduce(key, values):
    collect(key, values)

(b)

  • Map: output of (a) as input, produce the list of combination in same bucket, i.e. (bucket_id, list(elements)) -> (bucket_id, combination(list(elements))), which combination() is any two elements chosen from same bucket.
  • Reduce: output the item pairs need to be compared, Specifically, for each column i, there should be a list of those columns j > i with which i needs to be compared.

map(key, value):
    for itemA, itemB in combinations(value)
        key = (itemA.id, itemB.id)
        collect(key, [itemA, itemB])

reduce(key, values):
    collect(key, values)
WeiYuan
  • 5,922
  • 2
  • 16
  • 22